2010 TheSemanticVectorsPackageNewAlg

From GM-RKB
Jump to navigation Jump to search

Subject Headings: SemanticVectors, Distributional Text-Item Vectors.

Notes

Cited By

Quotes

Abstract

Distributional semantics is the branch of natural language processing that attempts to model the meanings of words, phrases and documents from the distribution and usage of words in a corpus of text. In the past three years, research in this area has been accelerated by the availability of the Semantic Vectors package, a stable, fast, scalable, and free software package for creating and exploring concepts in distributional models. This paper introduces the broad field of distributional semantics, the role of vector models within this field, and describes some of the results that have been made possible by the Semantic Vectors package. These applications of Semantic Vectors have so far included contributions to medical informatics and knowledge discovery, analysis of scientific articles, and even Biblical scholarship. Of particular interest is the recent emergence of models that take word order and other ordered structures into account, using permutation of coordinates to model directional relationships and semantic predicates.

I. Distributional Semantics and Vector Models

Distributional semantics is an empirical field of research and development that attempts to discover and model the meanings of words by analyzing and comparing their distributions in large text corpora. This approach to meaning can be traced at least to the philosopher Wittgenstein:

For a large class of cases — though not for all — in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language. [1,x43]

and the linguist Firth:

You shall know a word by the company it keeps. [2]

In a sense, this principle has been implemented for centuries in traditional concordances (see [3, Ch1]), and its application to information in electronic form began during the first few decades of research in information retrieval [4], [5]. As with some of the fundamental models for search engines, some of these approaches are consciously probabilistic (representing word distributions as probability distributions), and others are consciously geometric (representing word distributions as vectors in high-dimensional spaces). In recent years, insights from quantum mechanics have suggested that these two mindframes can be readily derived from common principles [6], but this unification is beyond the scope of this paper, which thus describes developments in geometric models and makes no attempt to describe these models alongside probabilistic alternatives.

The simplest distributional vector model is the term-document matrix used by vector model search engines [5], [ 7, Ch 5]. A term-document matrix is a table that keeps count of how many times each term in a corpus appears in each document. As with any matrix, the rows or columns can be thought of as vectors in some vector space, so in a termdocument matrix whose rows represent terms and whose columns represent documents, each row can be thought of as a term vector whose dimension (number of coordinates) is equal to the number of documents in the collection.

Not surprisingly, such matrices tend to be very sparse, which invites compression. Sparse representations are used in practice for optimized computation and storage, but computational optimization is not the only motivation. Semantically it is clearly incorrect to assume that each document (or each term) gives a new dimension orthogonal to all others, and in practice, being aware of redundancies here (e.g., cases in which one term can be defined using a combination of other terms) is a core part of human linguistic knowledge. For this reason, matrix factorization and other compression algorithms have been used, beginning with the application of singular value decomposition (SVD) to term-document matrices, in a technique known since the early 1990’s as latent semantic indexing or latent semantic analysis [8], [9]. Singular value decomposition produces a reduced set of orthogonal axes for representing term vectors, where the number of these axes (often between 100 and 500) is much smaller than the original number of dimensions (often equal to the total number of documents in the corpus). It has sometimes been demonstrated that words whose vectors are mapped closer together in this decomposition are often closely related and even synonymous. Singular value decomposition is, however, a computationally costly process, normally of time complexity at least O (qp2) for a matrix of size (q; p) [10, x31]. In addition, the process is memory intensive, largely because the full matrix (before compression) has to be stored in memory.

Random Projection is a cheaper alternative to singular value decomposition which provides many of the same semantic benefits [11]. Instead of spending large computational resources guaranteeing that the basis for the reduced space consists of orthogonal vectors, Random Projection rests on the observation that, in high dimensions, any pair of vectors chosen at random will be nearly orthogonal. In practice, an easy way of generating such pseudo-orthogonal vectors is to allocate a vector of zeros, and randomly change a small number of coordinates to + 1 and the same small number of coordinates to ? ?1. For example, consider the following random vectors: [math]\displaystyle{ A = (0; 0; 1; 0; 1; 0; ? ? 1;:::; 0; 0) }[/math]: [math]\displaystyle{ B = (0; 1; 0; 0; ? ? 1; 0; 1;:::; 0; ? ? 1) }[/math] and the scalar product [math]\displaystyle{ A \ cdot B = \ Sigma A_iB_i }[/math]. It is easy to see that most of the individual products AiBi are zero, and that the nonzero products are evenly distributed between + 1 and ? ?1 contributions, leaving an expected total near to zero. An important result that follows from this is the Johnson - Lindenstrass Lemma [12], which guarantees that projection from a high-dimensional vector space onto a smaller subspace is likely to preserve the ordering of scalar products. Thus, Random Projection can be used to reduce the number of dimensions of a high-dimensional space of term vectors, while preserving most of the relationships between those vectors. Note that we should not expect random projection to improve these relationships, for example by mapping synonyms closer together, but this challenge can be addressed using Reflective Random Indexing, as discussed in Section IV-C.

A key observation is that rather than being roughly cubic, Random Projection from a sparsely represented termdocument matrix can be made to perform more-or-less linearly [13]. This is a key advantage. A couple of decades ago, we might perhaps have argued that “Moore’s law will take careof it.” With hindsight, the opposite has happened — the size of linguistic corpora has grown at least as fast as the resources available to any one machine, so algorithms whose complexity is any more than linear in the size of the corpus are becoming gradually less tractable rather than more. This naturally leads to increased interest in algorithms that can take advantage of huge datasets, even if they are less sensitive on small datasets.

Examples of the time taken for matrix compression using Random Projection and SVD on different corpora are shown in Table I. Random Projection always outperforms SVD in this area, and the difference become more pronounced with larger corpora. Note that the numbers are not completely representative of the algorithms in question, because they also include many basic I/O operations. The N/A (not applicable) entry for SVD on 5,000,000 Medline documents is because in this case the workstation ran out of memory (max heap allocation was set to 8GB).

The best way to get an intuitive feel for the power and potential of these distributional semantic methods is to consider some examples, as shown in Table II. The ‘RP’ column shows results from Random Projection, the ‘SVD’ column shows results from Singular Value Decomposition. These results are reasonably typical: from manual inspection, we have not yet found queries where these algorithms give results of significantly different quality (though if such results are found, we will endeavor to make them publicly known).

TABLE I
TIME TAKEN FOR THE SEMANTIC VECTORS PACKAGE TO INDEX DIFFERENT DOCUMENT COLLECTIONS USING RANDOM PROJECTION AND SINGULAR VALUE DECOMPOSITION
Corpus Num Docs RP Time SVD Time
King James Bible 1256 2.8s 5.8s
Europarl English 4047 13s 37s
TASA 44,487 21.27s 2m 27s
Medline 1,000,000 1m 17s 6m 19s
Medline 5,000,000 2m 15s N/A
TABLE II
NEAREST NEIGHBOR TERMS WITH THEIR COSINE SIMILARITIES IN VARIOUS MODELS
King James Bible. Query “abraham”.
RP SVD
1.00 abraham 1.00 abraham
0.64 sarah 0.81 isaac
0.54 isaac 0.68 sarah
0.47 bethuel 0.62 rebekah
0.44 mamre 0.61 phicol

0.42 jidlaph 0.57 bethuel 0.42 jehovahjireh 0.56 herdmen 0.42 thahash 0.55 nahor 0.42 tebah 0.54 ahuzzath 0.42 pildash o.54 esek

English Europarl Corpus. Query “wine”.
RP SVD
1.00 wine 1.00 wine
0.78 wines 0.76 wines
0.77 vineyards 0.63 grapes
0.76 musts 0.54 musts
0.75 vineyard 0.48 winegrowers
0.73 bottling 0.47 litre
0.69 grape 0.46 alcohol
0.68 distillation 0.43 alcoholic
0.67 saccharose 0.42 klerk
0.66 liqueur 0.42 distillation

From results such as these it can be seen that the distributional hypothesis can produce results with clearly semantically significant results, as has been shown in many research experiments (see [7], [14] for many references). However, distributional models have yet to realize their full potential as part of the industrial mainstream. We believe that part of the problem here has been with software performance and reliability: hence much of the research in this paper focuses on this area.

II. THE SEMANTIC VECTORS PACKAGE

The Semantic Vectors package, originally created for the University of Pittsburgh, was released as an open source package in October 2007 [15]. The package can be freely downloaded over the web from http://code.google.com/p / semanticvectors. It is released under a liberal BSD license that permits commercial and non-commercial use and modifications. The goal of the project is to provide a scalable and stable platform for building commercial scale distributional semantics applications, and for researchers and developers to use as a platform for implementing new algorithms. Since the release of the package there have been over 6000 downloads, the development team has slowly grown, and an increasing variety of algorithms and features have been added. To date, these include:

Vital to the growth of the project, these developments have not come from a single source, but from several contributors from several institutions. The package has supported several experimental analyses that have led to publications and successful dissertation and thesis projects. It has also provided a framework within which research engineers can relatively easily create implementations of recently published research, and so build rapidly upon the state of the art.

This report summarizes the design and development of the Semantic Vectors project, and presents some of the research that has been performed using the package. Most of the features described in this report are released as part of the standard Semantic Vectors package and can readily be demonstrated in action. This article also includes references to other projects that have successfully used the package.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 TheSemanticVectorsPackageNewAlgDominic Widdows
Trevor Cohen
The Semantic Vectors Package: New Algorithms and Public Tools for Distributional Semantics10.1109/ICSC.2010.942010