Cross-Lingual Text Mining

Jump to: navigation, search

A Cross-Lingual Text Mining is a category of text mining tasks for retrieving and accessing information from document collections written in several languages



These tasks can in principle be performed using methods which do not involve any Text Mining, but as a matter of fact all of them have been successfully approached relying on the statistical analysis of multilingual document collections, especially parallel corpora. While CLTM tasks differ in many respect, they are all characterized by the fact that they require to reliably measure the similarity of two text spans written in different languages. There are essentially two families of approaches for doing this:
1. In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in monolingual cases. As a variant, both text spans can be translated in a third pivot language.
2. In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used.