2004 TowardsTerascaleKnowledgeAcquisition

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Web-based Information Extraction, Is-A Relation.

Notes

Cited By

2004

Quotes

Abstract

Although vast amounts of textual data are freely available, many NLP algorithms exploit only a minute percentage of it. In this paper, we study the challenges of working at the terascale. We present an algorithm, designed for the terascale, for mining is-a relations that achieves similar performance to a state-of-the-art linguistically-rich method. We focus on the accuracy of these two systems as a function of processing time and corpus size.

Pattern-based approaches

Marti Hearst (1992) was the first to use a pattern-based approach to extract hyponym relations from a raw corpus. She used an iterative process to semi-automatically learn patterns. However, a corpus of 20MB words yielded only 400 examples. Our pattern-based algorithm is very similar to the one used by Hearst. She uses seed examples to manually discover her patterns whearas we use a minimal edit distance algorithm to automatically discover the patterns.

Riloff and Shepherd (1997) used a semiautomatic method for discovering similar words using a few seed examples by using pattern-based techniques and human supervision. Berland and Charniak (1999) used similar pattern-based techniques and other heuristics to extract meronymy (part-whole) relations. They reported an accuracy of about 55% precision on a corpus of 100,000 words. Girju et al. (2003). improved upon Berland and Charniak’s work using a machine learning filter. Mann (2002) and Fleischman et al. (2003). used part of speech patterns to extract a subset of hyponym relations involving proper nouns.

Our pattern-based algorithm differs from these approaches in two ways. We learn lexico-POS patterns in an automatic way. Also, the patterns are learned with the specific goal of scaling to the terascale (see Table 2).

Scalable pattern-based approach

We propose an algorithm for learning highly scalable lexico-POS patterns. Given two sentences with their surface form and part of speech tags, the algorithm finds the optimal lexico-POS alignment. For example, consider the following 2 sentences:

  • 1) Platinum is a precious metal.
  • 2) Molybdenum is a metal.

Applying a POS tagger (Brill 1995) gives the following output: Surface Platinum is a precious metal .

 POS         NNP      VBZ   DT     JJ         NN      .

Surface Molybdenum is a metal . POS NNP VBZ DT NN .

A very good pattern to generalize from the alignment of these two strings would be Surface is a metal .

 POS     NNP                           .

We use the following notation to denote this alignment: "_NNP is a (*s*) metal.", where "_NNP represents the POS tag NNP".

To perform such alignments we introduce two wildcard operators, skip (*s*) and wildcard (*g*). The skip operator represents 0 or 1 instance of any word (similar to the \w* pattern in Perl), while the wildcard operator represents exactly 1 instance of any word (similar to the \w+ pattern in Perl).

References

  • Banko, M. and Brill, E. (2001). Mitigating the paucity of data problem. In: Proceedings of HLT-2001. San Diego, CA.
  • Berland, M. and Eugene Charniak, (1999). Finding parts in very large corpora. In ACL-1999. pp. 57–64. College Park, MD.
  • Brill, E., (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543–566.
  • Brill, E.; Lin, J.; Banko, M.; Dumais, S.; and Ng, A. (2001). Dataintensive question answering. In: Proceedings of the TREC-10 Conference, pp 183–189. Gaithersburg, MD.
  • Caraballo, S. (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In: Proceedings of ACL-99. pp 120–126, Baltimore, MD.
  • Curran, J. and Moens, M. (2002). Scaling context space. In: Proceedings of ACL-02. pp 231–238, Philadelphia, PA.
  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 191 (1993), 61–74.
  • Oren Etzioni; Cafarella, M.; Downey, D.; Kok, S.; Popescu, A.M.; Shaked, T.; Soderland, S.; Weld, D. S.; and Yates, A. (2004). Webscale information extraction in Know-It All (Preliminary Results). To appear in the Conference on WWW.
  • Fleischman, M.; Eduard Hovy; and Echihabi, A. (2003). Offline strategies for online question answering: Answering questions before they are asked. In: Proceedings of ACL-03. pp. 1–7. Sapporo, Japan.
  • Girju, R.; Badulescu, A.; and Dan Moldovan (2003). Learning semantic constraints for the automatic discovery of part-whole relations. In: Proceedings of HLT/NAACL-03. pp. 80–87. Edmonton, Canada.
  • Harris, Z. (1985). Distributional structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26–47.
  • Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING-92. pp. 539–545. Nantes, France.
  • Hindle, D. (1990). Noun classification from predicate-argument structures. In: Proceedings of ACL-90. pp. 268–275. Pittsburgh, PA.
  • Dekang Lin (1994). Principar - an efficient, broad-coverage, principle-based parser. Proceedings of COLING-94. pp. 42–48. Kyoto, Japan.
  • Dekang Lin (1998). Automatic retrieval and clustering of similar words. In: Proceedings of COLING/ACL-98. pp. 768–774. Montreal, Canada.
  • Mann, G. S. (2002). Fine-Grained Proper Noun Ontologies for Question Answering. SemaNet’ 02: Building and Using Semantic Networks, Taipei, Taiwan.
  • George A. Miller (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4).
  • Och, F.J. and Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of ACL. pp. 295–302. Philadelphia, PA.
  • Patrick Pantel and Dekang Lin (2002). Discovering Word Senses from Text. In: Proceedings of SIGKDD-02. pp. 613–619. Edmonton, Canada.
  • Patrick Pantel and Ravichandran, D. (2004). Automatically labeling semantic classes. In: Proceedings of HLT/NAACL-04. pp. 321–328. Boston, MA.
  • Ellen Riloff and Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In: Proceedings of EMNLP-1997.
  • Ellen Voorhees. (2003). Overview of the question answering track. In: Proceedings of TREC-12 Conference. NIST, Gaithersburg, MD.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TowardsTerascaleKnowledgeAcquisitionEduard Hovy
Patrick Pantel
Deepak Ravichandran
Towards Terascale Knowledge Acquisitionhttp://www.isi.edu/natural-language/people/ravichan/papers/coling04.pdf10.3115/1220355.1220466