Comparable Corpus

A Comparable Corpus is a collection of text corpora using the same sampling frame, topic or representation.

AKA: Comparable Corpora, Comparable Text.
- …
Example(s):
- Wikipedia Comparable Corpora.
- Comparable Corpora based on Brown corpus of Standard American English, LOB Corpus and Kolhapur Corpus.
Counter-Example(s):
- A Monolingual Corpus
- A Multilingual Corpus.
- A Aligned Parallel Corpus.
- A Translation Corpus
- A Parallel Corpus.
See: Text Corpus, Foreign Language Writing Aid, Machine Translation, Annotation, Part-of-Speech Tagging, Lemma (Morphology), Interlinear Gloss, Parsing, Treebank, Morphology (Linguistics), Semantics, Pragmatics, Corpus Linguistics.

References

2017a

(W3-Corpora, 2017) ⇒ W3-Corpora Project: Comparable Corpora http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/types/comparable.html Retrieved 2017-05-28
- A Comparable Corpus is a collection of "similar" texts in different languages or in different varieties of a language.

The criteria to define the similarity beteween texts is not clearly defined, but the aim of these type of corpora is to compare the languages or varieties presented in similar circumstances of communication, without the distorsions which appear in translated texts of Parallel Corpora

Examples of comparable corpora are those mirrored on the Brown corpus of Standard American English, for example, the LOB Corpus (British English), and the Kolhapur Corpus (Indian English).

Within the ICE Project (International Corpus of English), twelve centres around the world are preparing corpora of their own national or regional variety of English. The first of these (ICE-GB) will be available from spring 1998.

2017b

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Text_corpus#Overview Retrieved:2017-5-28.
- (...) In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. Corpora can be considered as a type of foreign language writing aid as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.^[1]

↑ Yoon, H., & Hirvela, A. (2004). ESL Student Attitudes toward Corpus Use in L2 Writing. Journal Of Second Language Writing, 13(4), 257–283. Retrieved 21 March 2012.

2017c

(Linguatools, 2017) ⇒ "Wikipedia Comparable Corpora" http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/ Retrieved 2017-05-28
- The Wikipedia Comparable Corpora are bilingual document-aligned text corpora. They have been extracted from the Wikipedia Monolingual Corpora’s XML files using the crosslanguage links. Each comparable corpus consists of document pairs: Wikipedia articles in language L1 and the linked article in language L2 on the same subject. Alltogether, there are over 41 million aligned articles for 253 language pairs. The 253 corpus files occupy 405 GB disk space when unzipped.

2014

(SMT, 2014) ⇒ SMT Research Survey Wiki: Comparable Corpora http://www.statmt.org/survey/Topic/ComparableCorpora
- QUOTE: A comparable corpus is a pair of corpora in two different languages, which come from the same domain.

(...) Parallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages. Munteanu and Marcu (2002)^[1] uses suffix trees, and in later work log-likelyhood ratios (Munteanu et al., 2004; Munteanu and Marcu, 2005), to detect parallel sentences.

Abdul-Rauf and Schwenk (2009); Rauf and Schwenk (2009); Rauf and Schwenk (2011) translate one side of the comparable corpus into the other language, use information retrieval methods to find matching sentences and use the TER metric to measure their similarity. \,Stef\uanescu et al. (2012) report improvements with a more complex sentence similarity measure.

Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora (Munteanu and Marcu, 2006). Methods have been proposed to extract matching phrases (Tanaka, 2002) or web pages (Smith, 2002) from such large collections. Quirk et al. (2007) propose a generative model for the same task.

Hewavitharana and Vogel (2011) extract phrase pairs from comparable corpora, using a classifier approach.

2011

(Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Comparable Corpus.” In: (Sammut & Webb, 2011) p.194
- comparable corpus (pl, corpora) is a document collection composed of two more disjoint subsets, each written in a different language, such that documents in each subset are on a same topic as the documents in others.

2010

(Skadiņa et al., 2010) ⇒ Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas, Dan Tufiş, and Tatiana Gornostay. (2010). “Analysis and Evaluation of Comparable Corpora for under Resourced Areas of Machine Translation.” In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities.
- QUOTE: … in the context of machine translation to describe the pairing of text in one... alignment strategies designed for parallel corpora, comparable corpora, and non-comparable corpora, we showed... and Moses) are not well suited for use directly on strongly and weakly comparable texts.

2007

(McEnery & Xiao, 2007) ⇒ McEnery, A., & Xiao, R. (2007). Parallel and comparable corpora: What is happening. Incorporating Corpora. The Linguist and the Translator, 18-31. http://core.ac.uk/download/pdf/71933.pdf
- (...) In contrast, a comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (cf. McEnery, 2003: 450), e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period. However, the subcorpora of a comparable corpus are not translations of each other. Rather, their comparability lies in their same sampling frame and similar

balance.

↑ Munteanu, Dragos Stefan and Marcu, Daniel (2002): Processing Comparable Corpora With Bilingual Suffix Trees, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) DOI:10.3115/1118693.1118730

[Yoon-1] Yoon, H., & Hirvela, A. (2004). ESL Student Attitudes toward Corpus Use in L2 Writing. Journal Of Second Language Writing, 13(4), 257–283. Retrieved 21 March 2012.

[2] Munteanu, Dragos Stefan and Marcu, Daniel (2002): Processing Comparable Corpora With Bilingual Suffix Trees, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) DOI:10.3115/1118693.1118730

[1]

[1]