2006 IntroToTheSpecIssOnTheWebAsCorpus

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Web Corpus, Corpus.

Notes

Cited By

Quotes

Abstract

The Web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists’ playground. This special issue of Computational Linguistics explores ways in which this dream is being explored.

1. Introduction

The Web is immense, free, and available by mouse click. It contains hundreds of billions of words of text and can be used for all manner of language research. The simplest language use is spell checking. Is it speculater or speculator? Google gives 67 for the former (usefully suggesting the latter might have been intended) and 82,000 for the latter. Question answered.

Language scientists and technologists are increasingly turning to the Web as a source of language data, because it is so big, because it is the only available source for the type of language in which they are interested, or simply because it is free and instantly available. The mode of work has increased dramatically from a standing start seven years ago with the Web being used as a data source in a wide range of research activities: The papers in this special issue form a sample of the best of it. This introduction to the issue aims to survey the activities and explore recurring themes. We first consider whether the Web is indeed a corpus, then present a history of the theme in which we view the Web as a development of the empiricist turn that has brought corpora center stage in the course of the 1990s. We briefly survey the range of Web-based NLP research, then present estimates of the size of the Web, for English and for other languages, and a simple method for translating phrases. Next we open the Pandora’s box of representativeness (concluding that the Web is not representative of anything other than itself, but then neither are other corpora, and that more work needs to be done on text types). We then introduce the articles in the special issue and conclude with some thoughts on how the Web could be put at the linguist’s disposal rather more usefully than current search engines allow.

1.1 Is the Web a Corpus?

To establish whether the Web is a corpus we need to find out, discover, or decide what a corpus is. McEnery and Wilson (1996, page 21) say

  • In principle, any collection of more than one text can be called a corpus. . . But the term “corpus” when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition provides for. These may be considered under four main headings: sampling and representativeness, finite size, machine-readable form, a standard reference.

We would like to reclaim the term from the connotations. Many of the collections of texts that people use and refer to as their corpus, in a given linguistic, literary, or language-technology study, do not fit. A corpus comprising the complete published works of Jane Austen is not a sample, nor is it representative of anything else. Closer to home, Manning and Schutze (1999, page 120) observe:

  • In Statistical NLP, one commonly receives as a corpus a certain amount of data from a certain domain of interest, without having any say in how it is constructed. In such cases, having more training data is normally more useful than any concerns of balance, and one should simply use all the text that is available.

We wish to avoid a smuggling of values into the criterion for corpus-hood. McEnery and Wilson (following others before them) mix the question “What is a corpus?” with “What is a good corpus (for certain kinds of linguistic study)?” muddying the simple question “Is corpus x good for task y?” with the semantic question “Is x a corpus at all?” The semantic question then becomes a distraction, all too likely to absorb energies that would otherwise be addressed to the practical one. So that the semantic question may be set aside, the definition of corpus should be broad. We define a corpus simply as “a collection of texts.” If that seems too broad, the one qualification we allow relates to the domains and contexts in which the word is used rather than its denotation: A corpus is a collection of texts when considered as an object of language or literary study.

The answer to the question “Is the web a corpus?” is yes.

2. History

2.2 The 100M Words of the BNc

Another argument is made vividly by Banko and Brill (2001). They explore the performance of a number of machine learning algorithms (on a representative disambiguation task) as the size of the training corpus grows from a million to a billion words. All the algorithms steadily improve in performance, though the question “Which is best?” gets different answers for different data sizes. The moral: Performance improves with data size, and getting more data will make more difference than fine-tuning algorithms.



References

  • Agirre, Eneko, Olatz Ansa, Eduard Hovy and David Martinez. (2000). Enriching very large ontologies using the WWW. In Proceedings of the Ontology Learning Workshop of the European Conference of AI (ECAI), Berlin.
  • Atkins, Sue. (1993). Tools for computer-aided corpus lexicography: The Hector project.
  • Acta Linguistica Hungarica, 41:5–72.
  • Atkins, Sue, Jeremy Clear, and Nicholas Ostler. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1):1–16.
  • Baayen, Harald. (2001). Word Frequency Distributions. Kluwer, Dordrecht.
  • Baker, Collin F., Charles J. Fillmore, and John B. Lowe. (1998). The Berkeley FrameNet Project. In: Proceedings of COLING-ACL, pages 86–90, Montreal, August.
  • (Banko & Brill, 2001) ⇒ Banko, Michele and Eric Brill. (2001). Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and the 10th Conference of the European Chapter of the Association for Computational Linguistics, Toulouse.
  • Beaudouin, Vale´rie, Serge Fleury, Benoˆit Habert, Gabriel Illouz, Christian Licoppe, and Marie Pasquier. (2001). Typweb: d ´ ecrire la toile pour mieux comprendre les parcours. In Colloque International sur les Usages et les Services des Tél ´ ecommunications (CIUST’01), Paris, June. Available at http://www.cavi.univ-paris3.fr/ilpga/ilpga/sfleury/typweb.htm.
  • Beesley, Kenneth R. 1988. Language identifier: A computer program for automatic natural-language identification of on-line text. In Language at Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, pages 47–54, October 12–16.
  • Biber, Douglas. 1988. Variation across speech and writing. Cambridge University Press, Cambridge.
  • Biber, Douglas. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2):219–242.
  • Briscoe, Ted and John Carroll. (1997).
  • Automatic extraction of subcategorization from corpora. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 356–363, Washington, DC, April.
  • Buitelaar, Paul and Bogdan Sacaleanu. (2001). Ranking and selecting synsets by domain relevance. In: Proceedings of the Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, NAACL, Pittsburgh, June.
  • Burnard, Lou. (1995). The BNC Reference Manual. Oxford University Computing Service, Oxford.
  • Cavaglia, Gabriela. (2002). Measuring corpus homogeneity using a range of measures for inter-document distance. In: Proceedings of the Third International Conference on Language Resources and Evaluation, pages 426–431, Las Palmas de Gran Canaria, Spain, May.
  • Kenneth W. Church and Robert L. Mercer. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1–24.
  • Dumais, Susan, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. (2002). Web question answering: Is more always better? In: Proceedings of the 25th ACM SIGIR, pages 291–298, Tampere, Finland.
  • Fletcher, William. (2002). Facilitating compilation and dissemination of ad-hoc web corpora. In Teaching and Language Corpora 2002. Available at http:// miniappolis.com/KWiCFinder/ KWiCFinder.html .
  • Folch, Helka, Serge Heiden, Benoˆit Habert, Serge Fleury, Gabriel Illouz, Pierre Lafon, Julien Nioche, and Sophie Pre´vost. (2000).
  • Typtex: Inductive typological text classification by multivariate statistical analysis for NLP systems tuning/evaluation. In: Proceedings of the Second Language Resources and Evaluation Conference, pages 141–148, Athens, May–June.
  • Fujii, Atsushi and Tetsuya Ishikawa. (2000). Utilizing the World Wide Web as an encyclopedia: Extracting term descriptions from semi-structured text. In Proceedings of the 38th Meeting of the ACL, pages 488–495, Hong Kong, October.
  • Gildea, Daniel. (2001). Corpus variation and parser performance. In: Proceedings of the Conference on Empirical Methods in NLP, Pittsburgh, PA.
  • Greenwood, Mark, Ian Roberts, and Robert Gaizauskas. (2002). University of Sheffield TREC 2002 Q & A system. In E. M.
  • Voorhees and Lori P. Buckland, editors, The Eleventh Text Retrieval Conference (TREC-11), Washington. U.S. Government Printing Office.
  • Grefenstette, Gregory. (1995). Comparing two language identification schemes. In Proceedings of the Third International Conference on the Statistical Analysis of Textual Data (JADT’95), pages 263–268, Rome, December 11–13. Available at www.xrce.xerox.com/competencies/contentanalysis/ publications/Documents/P49030/ content/gg aslib.pdf .
  • Grefenstette, Gregory. (1999). The WWW as a resource for example-based MT tasks. Paper presented at ASLIB “Translating and the Computer” conference, London, October.
  • Grefenstette, Gregory. (2001). Very large lexicons. In Walter Daelemans, Khalil Simaan, Jakub Zavrel, and Jorn Veenstra, editors, Computational Linguistics in the Netherlands 2000: Selected Papers from the Eleventh CLIN Meeting, Language and Computers 37. Rodopi, Amsterdam.
  • Grefenstette, Gregory and Julien Nioche. (2000). Estimation of english and non-english language use on the WWW. In: Proceedings of the RIAO (Recherche d’Informations Assist ´ ee par Ordinateur), pages 237–246, Paris.
  • Hawking, D., E. Voorhees, N. Craswell, and P. Bailey. (1999). Overview of the TREC8 Web track. In: Proceedings of the Eighth Text Retrieval Conference, Gaithersburg, Maryland, November.
  • Ipeirotis, Panagiotis G., Luis Gravano, and Mehran Sahami. (2001). Probe, count, and classify: Categorizing hidden Web databases. In: Proceedings of the SIGMOD Conference, Santa Barbara, CA.
  • Jones, Rosie and Rayid Ghani. (2000). Automatically building a corpus for a minority language from the Web. In Proceedings of the Student Workshop of the 38th Annual Meeting of the Association for 347 Kilgarriff and Grefenstette Web as Corpus: Introduction Computational Linguistics, Hong Kong, pages 29–36.
  • Karlgren, Jussi and Douglass Cutting. (1994). Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of COLING-94, pages 1071–1075, Kyoto, Japan.
  • Kessler, Brett, Geoffrey Nunberg, and Hinrich Sch ¨ utze. (1997). Automatic detection of text genre. In: Proceedings of ACL and EACL, pages 39–47, Madrid.
  • Kilgarriff, Adam. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1):1–37.
  • Kilgarriff, Adam. (2003). Linguistic search engine. In Kiril Simov, editor, Shallow Processing of Large Corpora: Workshop Held in Association with Corpus Linguistics 2003, Lancaster, England, March.
  • Kilgarriff, Adam and Michael Rundell. (2002). Lexical profiling software and its lexicographical applications — A case study. In: Proceedings of EURALEX ’02, Copenhagen, August.
  • Kittredge, Richard and John Lehrberger. 1982. Sublanguage: Studies of Language in Restricted Semantic Domains. De Gruyter, Berlin.
  • Korhonen, Anna. (2000). Using semantically motivated estimates to help subcategorization acquisition. In Proceedings of the Joint Conference on Empirical Methods in NLP and Very Large Corpora, pages 216–223, Hong Kong, October.
  • Lawrence, Steve and C. Lee Giles. (1999). Accessibility of information on the Web. Nature, 400:107–109.
  • Manning, Christopher and Hinrich Sch ¨ utze. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
  • McEnery, Tony and Andrew Wilson. (1996). Corpus Linguistics. Edinburgh University Press, Edinburgh.
  • Mihalcea, Rada and Dan Moldovan. (1999). A method for word sense disambiguation of unrestricted text. In: Proceedings of the 37th Meeting of ACL, pages 152–158, College Park, MD, June.
  • Nagao, Makoto. 1984. A framework of a mechanical translation between Japanese and English by analogy principle. In Alick Elithorn and Ranan Banerji, editors, Artificial and Human Intelligence. North-Holland, Edinburgh, pages 173–180.
  • Peters, Carol, editor. (2001). Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum (CLEF 2000) Lisbon, Portugal, September 21–22, 2000, Revised Papers. Lecture Notes in Computer Science. Springer-Verlag.
  • Resnik, Philip. (1999). Mining the Web for bilingual text. In: Proceedings of the 37th Meeting of ACL, pages 527–534, College Park, MD, June.
  • Rigau, German, Bernardo Magnini, Eneko Agirre, and John Carroll. (2002). Meaning: A roadmap to knowledge technologies. In Proceedings of COLING Workshop on A Roadmap for Computational Linguistics, Taipei, Taiwan.
  • Roland, Douglas, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder, and Chris Riddoch. (2000). Verb subcategorization frequency differences between business-news and balanced corpora: The role of verb sense. In Proceedings of the Workshop on Comparing Corpora, 38th ACL, Hong Kong, October.
  • Sekine, Satshi. (1997). The domain dependence of parsing. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 96–102, Washington, DC, April.
  • Sinclair, John M., editor. 1987. Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London.
  • Varantola, Krista. (2000). Translators and disposable corpora. In: Proceedings of CULT (Corpus Use and Learning to Translate), Bertinoro, Italy, November.
  • Villasenor-Pineda, L., M. Montes y G´omez, M. Pe´rez-Coutino, and D. Vaufreydaz. (2003). A corpus balancing method for language model construction. In Fourth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2003), pages 393–401, Mexico City, February.
  • Volk, Martin. (2001). Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Proceedings of Corpus Linguistics 2001, Lancaster, England.
  • Vossen, Piek. (2001). Extending, trimming and fusing WordNet for technical documents. In: Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources, Pittsburgh, June. Available at http://engr.smu.edu/~rada/mwnw/ papers/WNW-NACL-205.pdf.gz.
  • Wilks, Yorick. (2003). On the ownership of text. Computers and the Humanities. Forthcoming.
  • Xu, J. L. (2000). Multilingual search on the World Wide Web. In: Proceedings of the Hawaii International Conference on System Science (HICSS-33), Maui, Hawaii, January.
  • Zheng, Zhiping. (2002). AnswerBus question answering system. In E. M. Voorhees and Lori P. Buckland, editors, Proceedings of HLT Human Language Technology Conference (HLT 2002), San Diego, CA, March 24–27.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 IntroToTheSpecIssOnTheWebAsCorpusAdam Kilgarriff
Gregory Grefenstette
Introduction to the Special Issue on the Web as CorpusComputational Linguistics (CL) Research Areahttp://ucrel.lancs.ac.uk/acl/J/J03/J03-3001.pdf10.1162/0891201033227115692006