2002 ASurveyOfDimeRedTech

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Dimensionality Reduction Task, Dimensionality Reduction Algorithm, Survey Paper, Principal Component Analysis, Factor Analysis, Projection Pursuit, Independent Component Analysis, Non-Linear Principal Component Analysis, Random Projections, Non-Linear Independent Component Analysis, Principal Curves, Multidimensional Scaling, Topological Continuous Map.

Notes

Quotes

Introduction

... Traditional statistical methods break down partly because of the increase in the number of observations, but mostly because of the increase in the number of variables associated with each observation. The dimension of the data is the number of variables that are measured on each observation.

High-dimensional datasets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments [11]. One of the problems with high-dimensional dataset s is that, in many cases, not all the measured variables are \important" for understanding the underlying phenomena of interest. While certain computationally expensive novel methods [4] can construct predictive models with high accuracy from high-dimensional data, it is still of interest in many applications to reduce the dimension of the original data prior to any modeling of the data.

In mathematical terms, the problem we investigate can be stated as follows: given the p-dimensional random variable 'x = (x1,...,xp)T, find a lower dimensional representation of it, s = (s1,...,sk)T with [math]\displaystyle{ k }[/math] <= [math]\displaystyle{ p }[/math], that captures the content in the original data, according to some criterion. The components of s are sometimes called the hidden components. Different fields use different names for the [math]\displaystyle{ p }[/math] multivariate vectors: the term "variable" is mostly used in statistics, while "feature" and "attribute" are alternatives commonly used in the computer science and machine learning literature.

Throughout this paper, we assume that we have [math]\displaystyle{ n }[/math] observations, each being a realization of the p-dimensional random variable x = (x1,...,xp)T with mean E('x) = μ = (μ1,...,μp)T and covariance matrix E{(x-μ)(x-μ)T = Σp x p. We denote such an observation matrix by X = {xi,j : 1 <= [math]\displaystyle{ i }[/math] <= [math]\displaystyle{ p }[/math], 1 <= [math]\displaystyle{ j }[/math] <= n}. If μi and σi = SQRT(Σp x p) denote the mean and the standard deviation of the ith random variable, respectively, then we will often standardize the observations xi,j by (xi,j - μi)i, where …

We distinguish two major types of dimension reduction methods: linear and non-linear. Linear techniques result in each of the k · p components of the new variable being a linear combination of the original variable

In this paper, we review traditional and current state-of-the-art dimension reduction methods published in the statistics, signal processing and machine learning literature. There are numerous books and articles [41, 17, 5, 14, 19, 46, 13] in the statistical literature on techniques for analyzing multivariate datasets. Advances in computer science, machine learning [43, 50, 44, 2]. Earlier survey papers. [7] reviews several methods, including principal components analysis, projection pursuit, principal curves, self-organizing maps, as well as provides neural network implementations of some of the reviewed statistical models. [22] surveys recent results in independent component analysis, in the context of other dimension reduction methods.

2 Principal component analysis

Principal component analysis (PCA) is the best, in the mean-square error sense, linear dimension reduction technique [25, 28]. Being based on the covariance matrix of the variables, it is a second-order method. In various fields, it is also known as the singular value decomposition (SVD), the Karhunen-Loµeve transform, the Hotelling transform, and the empirical orthogonal function (EOF) method.

In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs) of the original variables with the largest variance. The first PC, s1, is the linear combination with the largest variance.

3 Factor analysis

This section follows [41]. Like PCA, factor analysis (FA) is also a linear method, based on the second-order data summaries. First suggested by psychologists, FA assumes that the measured variables depend on some unknown, and often unmeasurable, common factors. Typical examples include variables de¯ned as various test scores of individuals, as such scores are thought to be related to a common "intelligence" factor. The goal of FA is to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model.

4 Projection pursuit

Projection pursuit (PP) is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets. It is more computationally intensive than second-order methods.

Given a projection index that defines the "interestingness" of a direction, PP looks for the directions that optimize that index. As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non-Gaussianity. If, however, one uses the second-order maximum variance, subject that the projections be orthogonal, as the projection index, PP yields the familiar PCA.

5 Independent component analysis

This section is based on [22], a recent survey on independent component analysis (ICA). More information (and software) on this currently very popular method can be found at various websites, including [6, 24, 49]. Books summarizing the recent advances in the theory and application of ICA include [1, 48, 15, 38].

ICA is a higher-order method that seeks linear projections, not necessarily orthogonal to each other, that are as nearly statistically independent as possible. Statistical independence is a much stronger condition than uncorrelatdness. While the latter only involves the second-order statistics, the former depends on all the higher-order statistics.

6 Non-linear principal component analysis

Non-linear PCA introduces non-linearity in the objective function, but the resulting components are still linear combinations of the original variables. This method can also be thought of as a special case of independent component analysis, Section 5.1.4. As indicated in [31], there are different formulations of the non-linear PCA.

7 Random projections

The method of random projections is a simple yet powerful dimension reduction technique that uses random projection matrices to project the data into lower dimensional spaces [47, 32, 33, 35]. The original data X 2 Rp is transformed to the lower dimensional S 2 Rk, with k ¿ p, via

[math]\displaystyle{ S = RX; \ (55) }[/math]

where the columns of R are realizations of independent and identically distributed (i.i.d.) zero-mean normal variables, scaled to have unit length. The method was proposed in the context of clustering text documents, where the initial dimension p can be on the order of 6000, and the ¯nal dimension k is still relatively large, on the order of 100. Under such circumstances, even PCA, the simplest alternative linear dimension reduction technique, can be computationally too expensive. Random projections are applied as a data pre-processing step, then, the resulting lower dimensional data is clustered. It has been shown empirically that results with the random projection method are comparable with results obtained with PCA, and take a fraction of the time PCA requires [33, 35].

8.1 Non-linear independent component analysis

Non-linear methods, such as principal curves, self organizing maps and topographic maps, can, in principle, be incorporated into ICA.

8.2 Principal curves

Principal curves are smooth curves that pass through the \middle" of multidimensional data sets [18, 40, 7]. Linear principal curves are in fact principal components, while non-linear principal curves generalize the concept.

8.3 Multidimensional scaling

Given n items in a p-dimensional space and an n £ n matrix of proximity measures among the items, multidimensional scaling (MDS) produces a k-dimensional, k · p, representation of the items such that the distances among the points in the new space reflect the proximities in the data [8, 7, 41]. The proximity measures the (dis)similarities among the items, and in general, it is a distance measure: the more similar two items are, the smaller their distance is. Popular distance measures include the Euclidean distance (L2 norm), the manhattan distance (L1, absolute norm), and the maximum norm. Results of MDS are indeterminate with respect to translation, rotation, and reflection.

MDS has been typically used to transform the data into two or three dimensions, and visualizing the result to uncover hidden structure in the data.

8.4 Topologically continuous maps

There are several methods based on finding a continuous map to transform a high-dimensional data into a lower-dimensional lattice (latent) space of ¯xed dimension [7]. Such techniques could be called self-organizing maps, but that name is most often associated with one particular such method, namely, Kohonen's self-organizing maps. To avoid confusion, we follow the review [7], and refer to these methods collectively as methods that use topologically continuous maps.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 ASurveyOfDimeRedTechImola K. FodorA Survey of Dimension Reduction Techniqueshttps://computation.llnl.gov/casc/sapphire/pubs/148494.pdf2002