Annotated Text Corpus

Jump to navigation Jump to search

An annotated text corpus is a text corpus with annotated documents.



    • LDC (Linguistic Data Consortium) and its catalogue by year.
      Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available).
    • European Language Resources Association and its catalogue.
      Distribution agency is ELDA. Rapidly growing collection of materials in Europeman languages.
    • ICAME (International Computer Archive of Modern English)
      Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message help to, by ftp to Also, manuals for these corpora.
    • Reuters @ NIST
      Reuters corpora are now distributed by NIST.
      TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
    • CLR (Consortium for Lexical Research)
      Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to Their catalog is available as a postscript file.
    • OTA (Oxford Text Archive)
      Provides mainly literary texts. Has a bright new web site. Email: Most materials are available on the web or by anonymous ftp to Some require negotiations with the providers.
    • Leipzig Corpora Collection
      Sentence collections in MySQL database for 17 mainly Europeman languages.
    • BNC (British National Corpus)
      A 100 million word corpus of British English. You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition.
    • European Corpus Initiative Multilingual Corpus I (ECI/MCI)
      A 98 million word corpus, covering most of the major Europeman languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
    • Survey of English Usage
      At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from [[ICE-GB and half from London-Lund).
    • International Corpus of English (ICE)
      Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.
    • Corpora held by Lancaster University
      This link provides its own annotations.
    • The European Language Activity Network
      Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
    • Talkbank.
      Rich video and transcripts.