1997 AutomaticCrossLanguageRetrieval

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Latent Semantic Indexing Algorithm; Automatic Indexing; Information Retrieval; Latent Semantic Analysis

Notes

Cited By

Quotes

Abstract

We describe a method for fully automated cross-language document retrieval in which no query translation is required. Queries in one language can retrieve documents in other languages (as well as the original language) This is accomplished by a method that automatically constructs a multilingual semantic space using Latent Semantic Indexing (LSI). Strong test results for the cross-language LSI (CL-LSI) method are presented for a new French-English collection. We also provide evidence that this automatic method performs comparably to a retrieval method based on machine translation (MT-LSI) and explore several practical training methods. By all available measures, CL-LSI performs quite well and is widely applicable.

Author's Notes:

Appears in: AAAI-97 Spring Symposium Series: CrossLanguage Text and Speech Retrieval. March 24-26, 1997, Stanford University, pp. 18-24.

Introduction

Cross-language LSI (CL-LSI) is a fully automatic method for cross-language document retrieval in which no query translation is required. Queries in one language can retrieve documents in other languages (as well as the original language). This is accomplished by a method that automatically constructs a multi-lingual semantic space using Latent Semantic Indexing (LSI).

For the CL-LSI method to be used, an initial sample of documents is translated by humans or, perhaps, by machine. From these translations, we produce a set of dual-language documents (i.e., documents consisting of parallel text from both languages) that are used to “train” the system. An LSI analysis of these training documents results in a dual-language semantic space in which terms from both languages are represented. Standard mono-lingual documents are then “folded in" to this space on the basis of their constituent terms. Queries in either language can retrieve documents in either language without the need to translate the query because all documents are represented as language-independent numerical vectors in the same LSI space.

We compare the CL-LSI method to a related method in which the initial training of the semantic space is performed using documents in one language only. To perform retrieval in this single-language semantic space, queries and documents in other languages are first translated to the language used in the semantic space using machine translation (MT) tools. We also examine several practical training issues.

Overview of Latent Semantic Indexing (LSI)

Most information retrieval methods depend on exact matches between words in users' queries and words in documents. Such methods will, however, fail to retrieve relevant materials that do not share words with users' queries. One reason for this is that the standard retrieval models (e.g., Boolean, standard vector, probabilistic) treat words as if they are independent, although it is quite obvious that they are not. A central theme of LSI is that term-term inter-relationships can be automatically modeled and used to improve retrieval; this is critical in cross-language retrieval since direct term matching is of little use.

LSI examines the similarity of the “contexts” in which words appear, and creates a reduced-dimension feature space in which words that occur in similar contexts are near each other. LSI uses a method from linear algebra, singular value decomposition (SVD), to discover the important associative relationships. It is not necessary to use any external dictionaries, thesauri, or knowledge bases to determine these word associations because they are derived from a numerical analysis of existing texts. The learned associations are specific to the domain of interest, and are derived completely automatically.

The singular-value decomposition (SVD) technique is closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). For information retrieval and filtering applications we begin with a large term-document matrix, in much the same way as vector or Boolean methods do (Salton and McGill, 1983). This term-document matrix is decomposed into a set of k, typically 200-300, orthogonal factors from which the original matrix can be approximated by linear combination. This analysis reveals the “latent” structure in the matrix that is obscured by variability in word usage.

 Figure 1 illustrates the effect of LSI on term representations using a geometric interpretation. Traditional vector methods represent documents as linear combinations of orthogonal terms, as shown in the left half of the figure. Doc 3 contains term 2, Doc 1 contains term 1, and Doc 2 contains both, but the terms are uncorrelated. In contrast, LSI represents terms as continuous values on each of the orthogonal indexing dimensions. Terms are not independent as depicted in the right half of Figure 1. When two terms are used in similar contexts (documents), they will have similar vectors in the reduced-dimension LSI representation. LSI partially overcomes some of the deficiencies of assuming independence of words, and provides a way of dealing with synonymy automatically without the need for a manually constructed thesaurus. Deeiwester et al. (1990) and Fumas et a1. (1988) present detailed mathematical descriptions and examples of the underlying LSI/ SVD method.

Figure 1. Term representations in the standard vector vs. reduced LSI vector models.

The result of the SVD is a set of vectors representing the location of each term and document in the reduced k-dimension LSI representation. Retrieval proceeds by using the terms in a query to identify a point in the space. Technically, the query is located at the weighted vector sum of its constituent terms. Documents are then ranked by their similarity to the query, typically using a cosine measure of similarity. While the most common retrieval scenario involves returning documents in response to a user query, the LSI representation allows for much more flexible retrieval scenarios. Since both term and document vectors are represented in the same space, similarities between any combination of terms and documents can be easily obtained - one can, for example, ask to see a term‘s nearest documents, a term‘s nearest terms, document’s nearest terms, or a document’s nearest documents. We have found all of these combinations to be useful at one time or another.

New documents (or terms) can be added to the LSI representation using a procedure we call “folding in”. This method assumes that the LSI space is a reasonable characterization of the important underlying dimensions of similarity, and that new items can be described in terms of the existing dimensions. A document is located at the weighted vector sum of its constituent terms. A new term is located at the vector sum of the documents in which it occurs.

In single-language document retrieval, the LSI method has equaled or outperformed standard vector methods in almost every case, and was as much as 30% better in some cases (Deerwester et al., 1990; Dumais, 1995).

Cross-Language Retrieval Using LSI

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1997 AutomaticCrossLanguageRetrievalSusan T. Dumais
Thomas K. Landauer
Michael L. Littman
Todd A. Letsche
Automatic Cross-language Retrieval Using Latent Semantic Indexing1997