2004 TheAutomaticContentExtractionACEProgram

From GM-RKB
Jump to navigation Jump to search

Subject Headings: ACE Program

Notes

Cited By

Quotes

Abstract

The objective of the ACE program is to develop technology to automatically infer from human language data the entities being mentioned, the relations among these entities that are directly expressed, and the events in which these entities participate. Data sources include audio and image data in addition to pure text, and Arabic and Chinese in addition to English. The effort involves defining the research tasks in detail, collecting and annotating data needed for training, development, and evaluation, and supporting the research with evaluation tools and research workshops. This program began with a pilot study in (1999). The next evaluation is scheduled for September 2004.

1. Introduction and Background

Today’s global web of electronic information, including most notably the www, provides a resource of unbounded information-bearing potential. But to fully exploit this potential requires the ability to extract content from human language automatically. That is the objective of the ACE program – to develop the capability to extract meaning from multimedia sources. These sources include text, audio and image data. [1 While the ACE program is directed toward extraction of information from audio and image sources in addition to pure text, the research effort is restricted to information extraction from text. The actual transduction of audio and image data into text is not part of the ACE research effort, although the processing of ASR and OCR output from such transducers is. Now, starting in 2004, events are being explored and added as the third of the three original tasks.] The ACE program is a “technocentric” research effort, meaning that the emphasis is on developing core enabling technologies rather than solving the application needs that motivate the research.

The program began in 1999 with a study intended to identify those key content extraction tasks to serve as the research targets for the remainder of the program. These tasks were identified in general as the extraction of the entities, relations and events being discussed in the language. In general objective, the ACE program is motivated by and addresses the same issues as the MUC program that preceded it (NIST 1999). The ACE program, however, attempts to take the task “off the page” in the sense that the research objectives are defined in terms of the target objects (i.e., the entities, the relations, and the events) rather than in terms of the words in the text. For example, the so-called “named entity” task, as defined in MUC, is to identify those words (on the page) that are names of entities. In ACE, on the other hand, the corresponding task is to identify the entity so named. This is a different task, one that is more abstract and that involves inference more explicitly in producing an answer. In a real sense, the task is to detect things that “aren’t there”. Reference resolution thus becomes an integral and critical part of solving the problem.

During the period 2000-2001, the ACE effort was devoted solely to entity detection and tracking. During the period 2002-2003, relations were explored and added.

2. Task Definitions

The Automatic Content Extraction (ACE) program, a new effort to stimulate and benchmark research in information extraction, presents four challenges:

  1. Recognition of entities, not just names. In the ACE entity detection and tracking (EDT) task, all mentions of an entity, whether a name, a description, or a pronoun, are to be found and collected into equivalence classes based on reference to the same entity. Therefore, practical co-reference resolution is fundamental.
  2. Recognition of relations. The relation detection and characterization task (RDC) requires detection and characterization of relations between (pairs of) entities. There are five general types of relations, some of which are further sub-divided, yielding a total of 24 types/subtypes of relations:
    • Role, the role a person plays in an organization, which can be subtyped as Management, General-Staff, Member, Owner, Founder, Client, Affiliate-Partner, Citizen-Of, or Other,
    • Part, i.e., part-whole relationships, subtyped as Subsidiary, Part-Of, or Other,
    • At, location relationships, which can be subtyped Located, Based-In, or Residence,
    • Near, to identify relative locations and
    • Social, subtyped as Parent, Sibling, Spouse, Grandparent, Other-Relative, Other-Personal, Associate, or Other-Professional.
  3. Event extraction. Though not in any previous ACE evaluation, event detection and characterization is planned for the 2004 evaluation (August-September, 2004). Details of the task definition, annotation guidelines, and scoring are being worked out at the time of writing this paper.
  4. Extraction is measured not merely on text, but also on speech and on OCR input. Moving beyond name finding is a crucial leap for modalities other than text, since the ability to relate two strings (as in ACE) in very noisy input may degrade much more than finding strings in isolation (as in named entity recognition.) Furthermore, the lack of case and punctuation, including the lack of sentence boundary markers, poses a challenge to full parsing of speech.

Data Representation

The ACE research targets, namely entities, relations, and events, are represented in terms of their underlying attributes and constituents. This information is output in XML format, by both LDC annotators and system developers, according to an “apf” document type definition (LDC 2004). For entities, there is a direct connection with the source data, in terms of the “mentions” of the entity. The identity of the entity is inferred from these mentions and from the entity attributes. The entity attributes are the type (person, organization, geo-political, location, facility, vehicle, weapon) and subtype of the entity, the entity class (specific, generic), and the name(s) of the entity that appear in the source data.

Relations are represented in terms of their attributes and their (two) arguments. The arguments are the ACE entities that are related by the relation. The attributes are the relation type and subtype.

Events are represented in terms of their attributes and their participants. The participants are the ACE entities that participate in the event. ACE events are in essence a generalization of ACE relations. An ACE event can have a number of participants, and each participant is characterized by a role that it plays in the event (agent, object, source, target). Currently the event attributes are event type (destroy, create, transfer, move, interact) and event modality (real, not real).

Data Annotation

Under the ACE (NIST 2003) and DARPA TIDES (TIDES 2004) Programs, the Linguistic Data Consortium at the University of Pennsylvania develops annotation guidelines, corpora and other linguistic resources to support information extraction research (LDC 2004). LDC's ACE annotators tag broadcast transcripts, newswire and newspaper data in English, Chinese and Arabic, producing both training and test data for common research task evaluations.

Annotation Tasks

There are three primary ACE annotation tasks corresponding to the three research tasks: Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (VDC). A fourth annotation task, Entity Linking (LNK), establishes co-reference between entity mentions.

EDT is the core annotation task, providing the foundation for all remaining tasks. The current ACE task identifies seven types of entities: Person, Organization, Location, Facility, Weapon, Vehicle and Geo-Political Entity (GPEs). Each type is further divided into subtypes (for instance, Organization subtypes include Government, Commercial, Educational, Non-profit, Other). Annotators tag all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identifies the maximal extent of the string that represents the entity and labels the head of each mention. Nested mentions are also captured. Each entity is classified according to its type and subtype. Each entity mention is further tagged according to its class – specific, generic, attributive, negatively quantified or underspecified. During the LNK annotation task, annotators review the entire document to group mentions of the same entity together; they also label cases of metonymy, where the name of one entity is used to refer to another entity (or entities) related to it.

During RDC tagging, annotators identify relations that exist between the entities tagged during the EDT task. There are five relation types in ACE: Role, Part, Located, Near, and Social. The Role relation links people to an organization to which they belong, own, founded, or provide some service. The Part relation indicates subset relationships, such as a state to a nation, or a subsidiary to its parent company. The At relation indicates the location of a person or organization at some location. The Near relation indicates the proximity of one location to another. The Social relation links two people in personal, familial or professional relationships. For each type there is a set of possible subtypes. Every relation takes two primary arguments: the two entities that it links. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader. For explicit relations annotators also identify any temporal attributes. Annotators do not include relationships dependent on a reader's knowledge of the world. All relations are based on textual or contextual evidence found within the scope of the document.

Bibliography

  • LDC, 2004, Automatic Content Extraction [www.ldc.upenn.edu/Projects/ACE/]
  • NIST, 1999, Message Understanding Conference [www.itl.nist.gov/iaui/894.02/related_projects/muc/]
  • NIST, 2003, Automatic Content Extraction [www.nist.gov/speech/tests/ace]
  • NIST, 2004, Automatic Content Extraction [www.nist.gov/speech/tests/ace/ace04]
  • TIDES, 2004, DARPA Program in Translingual Information Detection Extraction and Summarization [www.darpa.mil/ipto/programs/tides/index.htm],


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TheAutomaticContentExtractionACEProgramRalph Weischedel
George Doddington
Alexis Mitchell
Mark Przybocki
Stephanie Strassel
Lance A. Ramshaw
The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluationhttp://papers.ldc.upenn.edu/LREC2004/ACE.pdf