2004 TowardsTerascaleKnowledgeAcquisition

(Pantel et al., 2004) ⇒ Patrick Pantel, Deepak Ravichandran, Eduard Hovy. (2004). “Towards Terascale Knowledge Acquisition.” In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). doi:10.3115/1220355.1220466

Subject Headings: Web-based Information Extraction, Is-A Relation.

Notes

Cited By

~70 http://scholar.google.com/scholar? …

2004

(Ravichandran et al., 2004) ⇒ Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. (2004). “The Terascale Challenge.” In: Proceedings of KDD Workshop on Mining for and from the Semantic Web (MSW-04).

Quotes

Abstract

Although vast amounts of textual data are freely available, many NLP algorithms exploit only a minute percentage of it. In this paper, we study the challenges of working at the terascale. We present an algorithm, designed for the terascale, for mining is-a relations that achieves similar performance to a state-of-the-art linguistically-rich method. We focus on the accuracy of these two systems as a function of processing time and corpus size.

…

Pattern-based approaches

Marti Hearst (1992) was the first to use a pattern-based approach to extract hyponym relations from a raw corpus. She used an iterative process to semi-automatically learn patterns. However, a corpus of 20MB words yielded only 400 examples. Our pattern-based algorithm is very similar to the one used by Hearst. She uses seed examples to manually discover her patterns whearas we use a minimal edit distance algorithm to automatically discover the patterns.

Riloff and Shepherd (1997) used a semiautomatic method for discovering similar words using a few seed examples by using pattern-based techniques and human supervision. Berland and Charniak (1999) used similar pattern-based techniques and other heuristics to extract meronymy (part-whole) relations. They reported an accuracy of about 55% precision on a corpus of 100,000 words. Girju et al. (2003). improved upon Berland and Charniak’s work using a machine learning filter. Mann (2002) and Fleischman et al. (2003). used part of speech patterns to extract a subset of hyponym relations involving proper nouns.

Our pattern-based algorithm differs from these approaches in two ways. We learn lexico-POS patterns in an automatic way. Also, the patterns are learned with the specific goal of scaling to the terascale (see Table 2).

Scalable pattern-based approach

We propose an algorithm for learning highly scalable lexico-POS patterns. Given two sentences with their surface form and part of speech tags, the algorithm finds the optimal lexico-POS alignment. For example, consider the following 2 sentences:

1) Platinum is a precious metal.
2) Molybdenum is a metal.

Applying a POS tagger (Brill 1995) gives the following output: Surface Platinum is a precious metal .

 POS         NNP      VBZ   DT     JJ         NN      .

Surface Molybdenum is a metal . POS NNP VBZ DT NN .

A very good pattern to generalize from the alignment of these two strings would be Surface is a metal .

 POS     NNP                           .

We use the following notation to denote this alignment: "_NNP is a (*s*) metal.", where "_NNP represents the POS tag NNP".

To perform such alignments we introduce two wildcard operators, skip (*s*) and wildcard (*g*). The skip operator represents 0 or 1 instance of any word (similar to the \w* pattern in Perl), while the wildcard operator represents exactly 1 instance of any word (similar to the \w+ pattern in Perl).

…

References

Banko, M. and Brill, E. (2001). Mitigating the paucity of data problem. In: Proceedings of HLT-2001. San Diego, CA.
Berland, M. and Eugene Charniak, (1999). Finding parts in very large corpora. In ACL-1999. pp. 5764. College Park, MD.
Brill, E., (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543566.
Brill, E.; Lin, J.; Banko, M.; Dumais, S.; and Ng, A. (2001). Dataintensive question answering. In: Proceedings of the TREC-10 Conference, pp 183189. Gaithersburg, MD.
Caraballo, S. (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In: Proceedings of ACL-99. pp 120126, Baltimore, MD.
Curran, J. and Moens, M. (2002). Scaling context space. In: Proceedings of ACL-02. pp 231238, Philadelphia, PA.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 191 (1993), 6174.
Oren Etzioni; Cafarella, M.; Downey, D.; Kok, S.; Popescu, A.M.; Shaked, T.; Soderland, S.; Weld, D. S.; and Yates, A. (2004). Webscale information extraction in Know-It All (Preliminary Results). To appear in the Conference on WWW.
Fleischman, M.; Eduard Hovy; and Echihabi, A. (2003). Offline strategies for online question answering: Answering questions before they are asked. In: Proceedings of ACL-03. pp. 17. Sapporo, Japan.
Girju, R.; Badulescu, A.; and Dan Moldovan (2003). Learning semantic constraints for the automatic discovery of part-whole relations. In: Proceedings of HLT/NAACL-03. pp. 8087. Edmonton, Canada.
Harris, Z. (1985). Distributional structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 2647.
Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING-92. pp. 539545. Nantes, France.
Hindle, D. (1990). Noun classification from predicate-argument structures. In: Proceedings of ACL-90. pp. 268275. Pittsburgh, PA.
Dekang Lin (1994). Principar - an efficient, broad-coverage, principle-based parser. Proceedings of COLING-94. pp. 4248. Kyoto, Japan.
Dekang Lin (1998). Automatic retrieval and clustering of similar words. In: Proceedings of COLING/ACL-98. pp. 768774. Montreal, Canada.
Mann, G. S. (2002). Fine-Grained Proper Noun Ontologies for Question Answering. SemaNet 02: Building and Using Semantic Networks, Taipei, Taiwan.
George A. Miller (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4).
Och, F.J. and Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of ACL. pp. 295302. Philadelphia, PA.
Patrick Pantel and Dekang Lin (2002). Discovering Word Senses from Text. In: Proceedings of SIGKDD-02. pp. 613619. Edmonton, Canada.
Patrick Pantel and Ravichandran, D. (2004). Automatically labeling semantic classes. In: Proceedings of HLT/NAACL-04. pp. 321328. Boston, MA.
Ellen Riloff and Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In: Proceedings of EMNLP-1997.
Ellen Voorhees. (2003). Overview of the question answering track. In: Proceedings of TREC-12 Conference. NIST, Gaithersburg, MD.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 TowardsTerascaleKnowledgeAcquisition	Eduard Hovy Patrick Pantel Deepak Ravichandran			Towards Terascale Knowledge Acquisition			http://www.isi.edu/natural-language/people/ravichan/papers/coling04.pdf	10.3115/1220355.1220466