2008 InformationExtractionFromWikipedia

From GM-RKB
Jump to navigation Jump to search

Subject Headings:


Notes

  • Goal / vision
    • Every Wikipedia article describes one entity from a class that is defined by the type of the infobox (if present). Goal is the unsupervised conversion of Wikipedia into a structured format.
  • Challenges
    • For many entity types, there are not enough articles / entities. Many articles are too short and do not contain enough information to extract from.
  • Problem definition
    • Given a Wikipedia page / article, identify its infobox / entity class and extract as many attribute values (of that infobox) as possible.
  • Approach
    • A document classifier is trained to identify the infobox / entity class.
    • A sentence classifier is trained to predict which attribute values are contained in a particular sentence of an article belonging to a given infobox class.
    • Finally, an attribute extractor is learned to extract the actual attribute values from the sentences predicted to contain these values.
  • Use of infoboxes
    • In the training phase, infobox types are used to define training data for the document classifier and infobox attribute values are used to define training data for the sentence classifier.
    • In the test phase, infoboxes are ignored, i.e. document classification and attribute extraction use only the article text as input.
  • Shrinkage Method
    • Is a general statistical method to improve estimators in the case of limited training data.
    • In this paper, they apply shrinkage as follows.
      • They search upwards and downwards in the ontology of infoboxes to aggregate training data from related classes.
  • Extracting from the web
    • Many attribute values do not appear in the text of the article.
    • In order to improve the recall of attribute extraction, they apply the extractors trained from Wikipedia to other web pages.
    • Challenge: maintain precision of the extractors on lower quality non Wikipedia pages.
  • Discussion
    • From checking out a random sample of Wikepedia pages, I have the impression that for many, if not most of the attributes, the values do not appear in the text (but only in the infobox). There seems to be a great need for considering non-Wikipedia pages.
    • On the other hand, for some attributes multiple values appear in the corresponding article, e.g. for the headquarters of a company that changes in the course of the time. It seems to be hard to automatically pick the relevant one from these attribute values.
    • While Wikipedia data has not much commercial value, it has the advantage of providing a lot of ground truth, e.g. lists of entities of a particular type and infoboxes with correct attribute values. It also contains some ontology over the collection of infoboxes. Can / should we use Wikipedia to get training and test cases for our second application domain or at least for defining ground truth in our second application domain?

Cited By


Quotes

Abstract


,