2009 ComputationalStatistics

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Computational Statistics, Textbook.

Notes

Cited By

Quotes

Author Keywords

bootstrap - clustering and classification - nonparametric probability density estimation - numerical methods - random number generation

Book Overview

Computational inference is based on an approach to statistical methods that uses modern computational power to simulate distributional properties of estimators and test statistics. This book describes computationally intensive statistical methods in a unified presentation, emphasizing techniques, such as the PDF decomposition, that arise in a wide range of methods.

  • Provides comprehensive coverage of modern computationally intensive statistical methods
  • Covers the topics in numerical analysis necessary for accurate and efficient implementation of the methods of computational statistics
  • Emphasizes the unity of the methods of computational inference

Computational inference has taken its place alongside asymptotic inference and exact techniques in the standard collection of statistical methods. Computational inference is based on an approach to statistical methods that uses modern computational power to simulate distributional properties of estimators and test statistics. This book describes computationally-intensive statistical methods in a unified presentation, emphasizing techniques, such as the PDF decomposition, that arise in a wide range of methods.

Table of Contents

Part I Prelimnaries
  • Chapter 1. Mathematical and statistical preliminaries.
Part II Statistical Computing
  • Chapter 2. Computer storage and arithmetic.
  • Chapter 3. Algorithms and programming.
  • Chapter 4. Approximation of functions and numerical quadrature.
  • Chapter 5. Numerical linear algebra.
  • Chapter 6. Solution of nonlinear equations and optimization.
  • Chapter 7. Generation of random numbers.
Part III Methods of Computationsl Statistics
  • Chapter 8. Graphical methods in computational statistics.
  • Chapter 9. Tools for identification of structure in data.
  • Chapter 10. Estimation of functions.
  • Chapter 11. Monte Carlo methods for statistical inference.
  • Chapter 12. Data randomization, partitioning, and augmentation.
  • Chapter 13. Bootstrap methods.
Part IV Exploring Data Density and Relationship

Appendices

  • Monte Carlo Studies in Statistics

{{#ifanon:|

Preface

http://mason.gmu.edu/~jgentle/books/cmstatbk/Computational_Statistics_Preface.pdf

This book began as a revision of Elements of Computational Statistics, published by Springer in 2002. That book covered computationally-intensive statistical methods from the perspective of statistical applications, rather than from the standpoint of statistical computing.

Most of the students in my courses in computational statistics were in a program that required multiple graduate courses in numerical analysis, and so in my course in computational statistics, I rarely covered topics in numerical linear algebra or numerical optimization, for example. Over the years, however, I included more discussion of numerical analysis in my computational statistics courses. Also over the years I have taught numerical methods courses with no or very little statistical content. I have also accumulated a number of corrections and small additions to the elements of computational statistics. The present book includes most of the topics from Elements and also incorporates this additional material. The emphasis is still on computationallyintensive statistical methods, but there is a substantial portion on the numerical methods supporting the statistical applications.

I have attempted to provide a broad coverage of the field of computational statistics. This obviously comes at the price of depth.

Part I, consisting of one rather long chapter, presents some of the most important concepts and facts over a wide range of topics in intermediate-level mathematics, probability and statistics, so that when I refer to these concepts in later parts of the book, the reader has a frame of reference.

Part I attempts to convey the attitude that computational inference, together with exact inference and asymptotic inference, is an important component of statistical methods.

Many statements in Part I are made without any supporting argument, but references and notes are given at the end of the chapter. Most readers and students in courses in statistical computing or computational statistics will be familiar with a substantial proportion of the material in Part I, but I do not recommend skipping the chapter. If readers are already familiar with the material, they should just read faster. The perspective in this chapter is that of the “big picture”. As is often apparent in oral exams, many otherwise good students lack a basic understanding of what it is all about.

A danger is that the student or the instructor using the book as a text will too quickly gloss over Chapter 1 and miss some subtle points.

Part II addresses statistical computing, a topic dear to my heart. There are many details of the computations that can be ignored by most statisticians, but no matter at what level of detail a statistician needs to be familiar with the computational topics of Part II, there are two simple, higher-level facts that all statisticians should be aware of and which I state often in this book:

Computer numbers are not the same as real numbers, and the arithmetic operations on computer numbers are not exactly the same as those of ordinary arithmetic.

and

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Regarding the first statement, some of the differences in real numbers and computer numbers are summarized in Table 2.1 on page 98.

A prime example of the second statement is the use of the normal equations in linear regression, XT Xb = XTy. It is quite appropriate to write and discuss these equations. We might consider the elements of XT X, and we might even write the least squares estimate of β as [math]\displaystyle{ b = (XT X)^{−1} XTy }[/math]. That does not mean that we ever actually compute [math]\displaystyle{ X^T X }[/math] or [math]\displaystyle{ (XT X)^{−1} }[/math] , although we may compute functions of those matrices or even certain elements of them.

The most important areas of statistical computing (to me) are

These topics are the subjects of the individual chapters of Part II.

One of the Top 10 algorithms, dating to the 1940s, is the basis for MCMC methods, which began receiving attention by statisticians around 1990, and in the past twenty years has been one of the hottest areas in statistics. I could go on, but I tire of telling “the way we used to do it”. Let’s learn what we need to do it the best way now.

Data

I do not make significant use of any “real world” datasets in the book. I often use “toy” datasets because I think that is the quickest way to get the essential characteristics of a method. I sometimes refer to the datasets that are available in R or S-Plus, and in some exercises, I point to websites for various real world datasets.

Many exercises require the student to generate artificial data. While such datasets may lack any apparent intrinsic interest, I believe that they are often the best for learning how a statistical method works. One of my firm beliefs is

If I understand something, I can simulate it.

Learning to simulate data with given characteristics means that one understands those characteristics. Applying statistical methods to simulated data may lack some of the perceived satisfaction of dealing with “real data”, but it helps us better to understand those methods and the principles underlying them.

A Word About Notation

I try to be very consistent in notation. Most of the notation is “standard”. Appendix C contains a list of notation, but a general summary here may be useful. Terms that represent mathematical objects, such as variables, functions, and parameters, are generally printed in an italic font. The exceptions are the standard names of functions, operators, and mathematical constants, such as sin, log, Γ (the gamma function), Φ (the normal CDF), E (the expectation operator), d (the differential operator), e (the base of the natural logarithm), and so on.

I tend to use Greek letters for parameters and English letters for almost everything else, but in some cases, I am not consistent in this distinction.

I do not distinguish vectors and scalars in the notation; thus, “x” may represent either a scalar or a vector, and xi may represent either the i th element of an array or the i th vector in a set of vectors. I use uppercase letters for matrices and the corresponding lowercase letters with subscripts for elements of the matrices. I do not use boldface except for emphasis or for headings.

I generally use uppercase letters for random variables and the corresponding lowercase letters for realizations of the random variables. Sometimes I am not completely consistent in this usage, especially in the case of random samples and statistics.

I describe a number of methods or algorithms in this book. The descriptions are in a variety of formats. Occasionally they are just in the form of text; the algorithm is described in (clear?!) English text. Often they are presented in the form of pseudocode in the form of equations with simple for-loops, such as for the accumulation of a sum of corrected squares on page 116, or in pseudocode that looks more like Fortran or C. (Even if C-like statements are used, I almost always begin the indexing at the 1 st element; that is, at the first element, not the zeroth element. The exceptions are for cases in which the index also represents a power, such as in a polynomial; in such cases, the 0 th element is the first element. I call this “0 equals first” indexing.) Other times the algorithms are called “Algorithm x.x” and listed as a series of steps, as on page 218. There is a variation of the “Algorithm x.x” form. In one form the algorithm is given a name and its input is listed as input arguments, for example MergeSort, on page 122. This form is useful for recursive algorithms because it allows for an easy specification of the recursion. Pedagogic considerations (if not laziness!) led me to use a variety of formats for presentation of algorithms; the reader will likely encounter a variety of formats in literature in statistics and computing, and some previous exposure should help to make the format irrelevant.

Chapter 1: Mathematical and Statistical Preliminaries

The purpose of an exploration of data may be rather limited, and it may be ad hoc, or the purpose may be more general, perhaps to gain understanding of some natural phenomenon.

The questions addressed in the data exploration may be somewhat openended. The process of understanding often begins with general questions about the structure of the data. At any stage of the analysis, our understanding is facilitated by means of a model.

A model is a description that embodies our current understanding of a phenomenon. In an operational sense, we can formulate a model either as a description of a data-generating process, or as a prescription for processing data.

The model is often expressed as a set of equations that relate data elements to each other. It may include probability distributions for the data elements. If any of the data elements are considered to be realizations of random variables, the model is a stochastic model.

A model should not limit our analysis; rather, the model should be able to evolve. The process of understanding involves successive refinements of the model. The refinements proceed from vague models to more specific ones. An exploratory data analysis may begin by mining the data to identify interesting properties. These properties generally raise questions that are to be explored further.

A class of models may have a common form within which the members of the class are distinguished by values of parameters. For example, the class of normal probability distributions has a single form of a probability density function that has two parameters. Within this family of probability distributions, these two parameters completely characterize the distributional properties. If this form of model is chosen to represent the properties of a dataset, we may seek confidence intervals for values of the two parameters or perform statistical tests of hypothesized values of these two parameters.

In models that are not as mathematically tractable as the normal probability model — and many realistic models are not — we may need to use computationally intensive methods involving simulations, resamplings, and multiple views to make inferences about the parameters of a model. These methods are part of the field of computational statistics.

1.1 Discovering Structure in Data

The components of statistical datasets are “observations” and “variables” or “features”. In general, “data structures” are ways of organizing data to take advantage of the relationships among the variables constituting the dataset. Data structures may express hierarchical relationships, crossed relationships (as in “relational” databases), or more complicated aspects of the data (as in “object-oriented” databases). Data structures, or more generally, database management, is a relatively mature area of computer science.

In data analysis, “structure in the data” is of interest. Structure in the data includes such nonparametric features as modes, gaps, or clusters in the data, the symmetry of the data, and other general aspects of the shape of the data. Because many classical techniques of statistical analysis rely on an assumption of normality of the data, the most interesting structure in the data may be those aspects of the data that deviate most from normality. In identifying and studying structure, we use both numerical measures and graphical views.

Multiple Analyses and Multiple Views

There are many properties that are more apparent from graphical displays of the data.

Although it may be possible to express the structure in the data in terms of mathematical models, prior to attempting to do so, graphical displays may be used to discover qualitative structure in the data. Patterns observed in the data may suggest explicit statements of the structure or of relationships among the variables in the dataset. The process of building models of relationships is an iterative one, and graphical displays are useful throughout the process. Graphs comparing data and the fitted models are used to refine the models. Effective use of graphics often requires multiple views. For multivariate data, plots of individual variables or combinations of variables can be produced quickly and used to get a general idea of the properties of the data. The data should be inspected from various perspectives. Instead of a single histogram to depict the general shape of univariate data, for example, multiple histograms with different bin widths and different bin locations may provide more insight. (See Figure 8.4 on page 347.)

Sometimes, a few data points in a display can completely obscure interesting structure in the other data points. This is the case when the Euclidean distances between various pairs of data points differ greatly. A zooming window to restrict the scope of the display and simultaneously restore the scale

1.2 Mathematical Tools for Identifying Structure in Data

Eigenvalues and Eigenvectors

Multiplication of a given vector by a square matrix may result in a scalar multiple of the vector. If [math]\displaystyle{ A }[/math] is an [math]\displaystyle{ n \times n }[/math] matrix, [math]\displaystyle{ v }[/math] is a vector not equal to [math]\displaystyle{ 0 }[/math], and [math]\displaystyle{ c }[/math] is a scalar such that

[math]\displaystyle{ Av = cv, }[/math] (1.57)

we say [math]\displaystyle{ v }[/math] is an eigenvector of [math]\displaystyle{ A }[/math] and [math]\displaystyle{ c }[/math] is an eigenvalue of [math]\displaystyle{ A }[/math].

... The effect of a matrix multiplication of an eigenvector is the same as a scalar multiplication of the eigenvector. The eigenvector is an invariant of the transformation in the sense that its direction does not change under the matrix multiplication transformation. Thies would seem to indicate that the eigenvector and eigenvalue depend on some kind of deep properties of the matrix, and indeed, this is the case.

We immmediately see that if an eigenvalue of a matrix [math]\displaystyle{ A }[/math] is [math]\displaystyle{ 0 }[/math], then [math]\displaystyle{ A }[/math] must be singular.

The PDF Decomposition

Probability distributions have useful applications even in situations where there is no obvious data-generating process.

If f is a function such that [math]\displaystyle{ \Integral_D f(x) dx \lt \infty }[/math], then for some function g(x), we can write :[math]\displaystyle{ f(x) = g(x)p X (x) \, (1.92) }[/math] where p_X is the probability density function of a random variable X with support over the relevant domain of f. The decomposition of the function f in this way is called probability density function decomposition or PDF decomposition. The PDF decomposition allows us to use methods of statistical estimation for approximations involving f. We will see examples of the PDF decomposition in various applications, such as Monte Carlo methods (pages 192 and 418), function estimation (Chapters 10 and 15), and projection pursuit (Chapter 16).

1.4 Statistical Inference

For statistical inference, we generally assume that we have a sample of observations Y1,...,Y n on a random variable Y . A random sample, which we will usually just call a “sample”, is a set of independent and identically distributed (i.i.d.) random variables. We will often use Y to denote a random sample on the random variable Y . (This may sound confusing, but it is always clear from the context.) A statistic is any function of Y that does not involve any unobservable values. We may denote the actual observed values as [math]\displaystyle{ y_1,...,y_n }[/math] since they are not random variables.

We assume that the sample arose from some data-generating process or, equivalently, as a random sample from a probability distribution. Our objective is to use the sample to make inferences about the process. We may assume that the specific process, call it Pθ, is a member of some family of probability distributions P. For statistical inference, we fully specify the family P (it can be a very large family), but we assume some aspects of Pθare unknown. (If the distribution Pθthat yielded the sample is fully known, while there may be some interesting questions about probability, there are no interesting statistical questions.) Our objective in statistical inference is to determine a specific Pθ ∈ P, or some subfamily Pθ ⊂ P, that could likely have generated the sample.

The distribution may also depend on other observable variables. In general, we assume we have observations Y1,...,Y n on Y , together with associated observations on any related variable X or x. We denote the observed values as (y 1, x 1),...,(y n, x n), or just as y 1,...,y n. In this context, a statistic is any function that does not involve any unobserved values.

In statistical inference, we distinguish observable random variables and “parameters”, but we are not always careful in referring to parameters. We think of two kinds of parameters: “known” and “unknown”. A statistic is a function of observable random variables that does not involve any unknown parameters. The “θ” in the expression Pθabove may be a parameter, perhaps a vector, or it may just be some index that identifies the distribution within a set of possible distributions.

Types of Statistical Inference

There are three different types of inference related to the problem of determining the specific [math]\displaystyle{ P_θ \in \mathcal{P} }[/math]: point estimation, hypothesis tests, and confidence sets. In point estimation, the estimand is often some function of the basic parameter θ. We often denote the estimand in general as g(θ). Hypothesis tests and confidence sets are associated with probability statements that depend on Pθ. We will briefly discuss them in Section 1.5.

In parametric settings, each type of inference concerns a parameter,θ, that is assumed to be in some parameter space, Θ⊂IR k. If Θ is not a closed set, it is more convenient to consider the closure of Θ, denoted byΘ, because sometimes a good estimator may actually be on the boundary of the open set Θ. (If Θ is closed,Θ is the same set, so we can always just considerΘ.)

A related problem in estimation is prediction, in which the objective is to estimate the expected value of a random variable, given some information about past realizations of the random variable and possibly, [[covariates associated with those realizations]].

Performance of Statistical Methods for Inference

There are many properties of a method of statistical inference that may be relevant. In the case of point estimation a function of of the parameterθ, for example, we may use an estimator T(Y ) based on the sample Y . Relevant properties of T(Y ) include its bias, its variance, and its mean squared error.

The bias of T(Y ) for g(θ) is Bias(T,g(θ)) = E(T(Y ))−g(θ). (1.93) When it is clear what is being estimated, we often use the simpler notation Bias(T).

If this quantity is 0, then T(Y ) is said to be unbiased for g(θ). The mean squared error (MSE) of T(Y ) is …

Five Approaches to Statistical Inference

If we assume that we have a random sample of observations Y_1,...,Y_n on a random variable Y from some distribution [math]\displaystyle{ P_θ }[/math], which is a member of some family of probability distributions [math]\displaystyle{ \mathcal{P} }[/math], our objective in statistical inference is to determine a specific</math>P_θ ∈ P</math>, or some subfamily P_θ ⊂ P, that could likely have generated the sample.

Five approaches to statistical inference are

These approaches are not mutually exclusive.

The computational issues in statistical inference are varied. In most approaches an optimization problem is central, and many of the optimization problems cannot be solved analytically. Some approaches, such as those using the ECDF, lead to computationally intensive methods that use simulated datasets. The use of a loss function may yield very complicated integrals representing average loss. These cannot be evaluated analytically, and so are often evaluated using Monte Carlo methods.

We will discuss use of the ECDF more fully in Section 1.7, and in the rest of this section, we will briefly discuss the other approaches listed above. In Exercise 1.21 you are asked to obtain various estimates based on these approaches. You should pay particular attention to the specific model or assumptions that underlie each approach.

A Decision-Theoretic Approach; Loss and Risk

In the decision-theoretic approach to statistical inference, we call the inference a decision or an action, and we identify a cost or loss that depends on the decision and the true (but unknown) state of nature modeled by P∈ P.

Obviously, we try to take an action that minimizes the expected loss.

We call the set of allowable actions or decisions the action space or decision space, and we denote it as A. We base the inference on the random variable X; hence, the decision is a mapping from X, the range of X, to A.

If we observe data X, we take the action T(X) = a∈ A. The statistical procedure that leads to T(·) is the decision rule.

Loss Function

A loss function, L, is a mapping from P×A to [0,∞). The value of the function at a given distribution P∈ P for the action a is L(P, a). If P is indexed by θ, we can write the value of the function at a given value θ for the action a as L(θ, a).

Depending on the parameter space Θ, the action space A, and our objectives, the loss function often is a function only of g(θ)−a; that is, we may have L(θ, a) = L(g(θ)−a).

The loss function generally should be nondecreasing in [math]\displaystyle{ \mid g(θ)−a \mid }[/math]. A loss function that is convex has nice mathematical properties. A particularly nice loss function, which is strictly convex, is the “squared-error loss”: :[math]\displaystyle{ L2(θ, a) = (g(θ)−a) 2 . (1.99) }[/math]

Any strictly convex loss function over an unbounded interval is unbounded. It is not always realistic to use an unbounded loss function. A common bounded loss function is the 0-1 loss function, which may be

[math]\displaystyle{ L_{0−1}(θ, a) = 0 \text{if} \mid g(θ)−a \mid ≤ α(n) }[/math]
[math]\displaystyle{ L_{0−1}(θ, a) = 1 otherwise. }[/math]
Risk Function

To choose an action rule T so as to minimize the loss function is not a well-defined problem. We can make the problem somewhat more precise by considering the expected loss based on the action T(X), which we define to be the risk:

R(P, T) = E (L(P, T(X)) ) . (1.100)

The problem may still not be well defined. For example, to estimate g(θ) so as to minimize the risk function is still not a well-defined problem. We can make the problem precise either by imposing additional restrictions on the estimator or by specifying in what manner we want to minimize the risk.

Optimal Decision Rules

We compare decision rules based on their risk with respect to a given loss function and a given family of distributions. If a decision rule T∗has the property

[math]\displaystyle{ R(P, T∗)≤R(P, T)∀P∈ P, (1.101) }[/math]

for all T, then T∗is called an optimal decision rule.

Approaches to Minimizing the Risk

We use the principle of minimum risk in the following restricted ways. In all cases, the approaches depend, among other things, on a given loss function.

  • We may first place a restriction on the estimator and then minimize risk subject to that restriction. For example, we may

Likelihood

Given a sample Y1,...,Y n from distributions with probability densities pi (y), where all PDFs are defined with respect to a commonσ-finite measure, the likelihood function is L n(pi

Y ) =

�n i=1 pi (Yi ). (1.107) (Any nonnegative function proportional to L n(pi

Y ) is a likelihood function,

but it is common to speak of L n(pi

Y ) as “the” likelihood function.) We can

view the sample either as a set of random variables or as a set of constants, the realized values of the random variables. Thinking of the likelihood as a function of realized values, we usually use lower-case letters.

The log-likelihood function is the log of the likelihood: [math]\displaystyle{ l L n (pi ; y) = logL n(p_i \mid yi ), (1.108) }[/math] It is a sum rather than a product.

The n subscript serves to remind us of the sample size, and this is often very important in use of the likelihood or log-likelihood function particularly because of their asymptotic properties. We often drop the n subscript, however. Also, we often drop the L subscript on the log-likelihood. (I should also mention that some authors use the upper and lower cases in the opposite way from that given above.)

In many cases of interest, the sample is from a single parametric family. If the PDF is p(y ;θ) then the likelihood and log-likelihood functions are written as :[math]\displaystyle{ L(θ; y) = �n i=1 p(yi ;θ), (1.109) }[/math] and : [math]\displaystyle{ l(θ; y) = logL(θ; y). (1.110) }[/math]

We sometimes write the expression for the likelihood without the observations: L(θ) or l(θ).

The Parameter Is the Variable

Note that the likelihood is a function ofθfor a given y, while the PDF is a function of y for a given θ.

While if we think ofθas a fixed, but unknown, value, it does not make sense to think of a function of that particular value, and if we have an expression in terms of that value, it does not make sense to perform operations such as differentiation with respect to that quantity. We should think of the likelihood as a function of some dummy variable t, and write L(t; y) or l(t; y).

The likelihood function arises from a probability density, but it is not a probability density function. It does not in any way relate to a “probability” associated with the parameters or the model.

Although non-statisticians will often refer to the “likelihood of an observation”, in statistics, we use the term “likelihood” to refer to a model or a distribution given observations.

In a multiparameter case, we may be interested in only some of the parameters. There are two ways of approaching this, use of a profile likelihood or of a conditional likelihood.

Let [math]\displaystyle{ θ= (θ_1, θ_2) }[/math]. If θ_2 is fixed, the likelihood L(θ_1 ;θ_2, y) is called a profile likelihood or concentrated likelihood of θ_1 for given θ_2 and y.

If the PDFs can be factored so that one factor includes [math]\displaystyle{ θ_2 }[/math] and some function of the sample, [math]\displaystyle{ S(y) }[/math], and the other factor, given [math]\displaystyle{ S(y) }[/math], is free of [math]\displaystyle{ θ_2 }[/math], then this factorization can be carried into the likelihood. Such a likelihood is called a conditional likelihood of [math]\displaystyle{ θ_1 }[/math] given [math]\displaystyle{ S(y) }[/math].

Maximum Likelihood Estimation

The maximum likelihood estimate (MLE) of [math]\displaystyle{ θ }[/math], [math]\displaystyle{ \hat{θ} }[/math], is defined as :[math]\displaystyle{ \hat{θ} = arg max t∈Θ L(t; y), (1.111) }[/math] where Θ is the closure of the parameter space.

The MLE in general is not unbiased for its estimand. A simple example is the MLE of the variance in a normal distribution with unknown mean. If Y1,...,Y_n ∼i.i.d.N(μ, σ2), it is easy to see from the definition (1.111) that the MLE of [math]\displaystyle{ \sigma^2\lt /amth\gt , that is, of the [[variance]] of \lt math\gt Y }[/math] is

[math]\displaystyle{ V( �Y ) = 1 n fin i=1 (Yi −Y ) 2 . (1.112) }[/math]

Thus the MLE is (n−1)S 2 /n, where S 2 is the usual sample variance: S 2 = 1 n−1 �n i=1 (Yi −Y ) 2 . (1.113)

Notice that the MLE of the variance depends on the distribution. (See Exercise 1.16d.)

The MLE may have smaller MSE than an unbiased estimator, and, in fact, that is the case for the MLE ofσ 2 in the case of a normal distribution with unknown mean compared with the estimator S 2 ofσ 2 .

We will discuss statistical properties of maximum likelihood estimation beginning on page 70, and some of the computational issues of MLE in Chapter 6.

Score Function

In statistical inference, we often use the information about how the likelihood or log-likelihood would vary if θ were to change. (As we have indicated, “θ” sometimes plays multiple roles. I like to think of it as a fixed but unknown value and use “t” or some other symbol for variables that can take on different values. Statisticians, however, often use the same symbol to represent something that might change.) For a likelihood function (and hence, a loglikelihood function) that is differentiable with respect to the parameter, a function that represents this change and plays an important role in statistical inference is the score function: s n(θ; y) = ∂l(θ; y) ∂θ . (1.114)

Likelihood Equation

In statistical estimation, if there is a point at which the likelihood attains its maximum (which is, of course, the same point at which the log-likelihood attains its maximum) that point obviously is of interest; it is the MLE in equation (1.111).

If the likelihood is differentiable with respect to the parameter, the roots of the score function are of interest whether or not they correspond to MLEs. The score function equated to zero, ∂l(θ; y) ∂θ = 0, (1.115) is called the likelihood equation. The derivative of the likelihood equated to zero,∂L(θ; y)/∂θ= 0, is also called the likelihood equation. Equation (1.115) is an estimating equation; that is, its solution, if it exists, is an estimator. Note that it is not necessarily an MLE; it is a root of the likelihood equation, or RLE.

It is often useful to define an estimator as the solution of some estimating equation. We will see other examples of estimating equations in subsequent sections.

Likelihood Ratio

When we consider two different possible distributions for a sample y, we have two different likelihoods, say L0 and L1. (Note the potential problems in interpreting the subscripts; here the subscripts refer to the two different distributions. For example L0 may refer to L(θ 0 \mid y) in a notation consistent with that used above.) In this case, it may be of interest to compare the two likelihoods in order to make an inference about the two possible distributions. A simple comparison, of course, is the ratio, and indeed

[math]\displaystyle{ L(θ 0 ; y) L(θ 1 ; y), (1.116) }[/math] or L0/L1 in the simpler notation, is called the likelihood ratio with respect to the two possible distributions. Although in most contexts we consider the likelihood to be a function of the parameter for given, fixed values of the observations, it may also be useful to consider the likelihood ratio to be a function of y.

The most important use of the likelihood ratio is as the basis for statistical tests that are constructed following the Neyman-Pearson lemma for a simple null hypothesis versus a simple alternative hypothesis (see page 53). If the likelihood is monotone in θ 1, we can extend the simple hypotheses of the Neyman-Pearson lemma to certain composite hypotheses. Thus, a monotone likelihood ratio is an important property of a distribution.

The likelihood ratio, or the log of the likelihood ratio, plays an important role in statistical inference. Given the data y, the log of the likelihood ratio is called the support of the hypothesis that the data came from the population that would yield the likelihood L0 versus the hypothesis that the data came from the population that would yield the likelihood L1. The support of the hypothesis clearly depends on both L0 and L1, and it ranges over IR. The support is also called the experimental support.

Likelihood Principle

The likelihood principle in statistical inference asserts that all of the information which the data provide concerning the relative merits of two hypotheses (two possible distributions that give rise to the data) is contained in the likelihood ratio of those hypotheses and the data. An alternative statement of the likelihood principle is that, if for x and y, : [math]\displaystyle{ L(θ; x) L(θ; y) = c(x, y)∀θ, }[/math] where c(x, y) is constant for given x and y, then any inference aboutθbased on x should be in agreement with any inference about θ based on y.

Fitting Expected Values

Given a random sample Y1,...,Y n from distributions with probability densities p Yi (yi

θ), where all PDFs are defined with respect to a commonσ-finite

measure, if we have that E(Yi ) = gi (θ), then a reasonable approach to estimation ofθmay be to choose a valueθ � that makes the differences E(Yi )−gi (θ) close to zero.

We must define the sense in which the differences are close to zero. A simple way to do this is to define a nonnegative scalar-valued function of scalars,ρ(u, v), that is increasing in the absolute difference of its arguments. We then define [math]\displaystyle{ S(θ, y) = �n i=1 ρ(yi , θ), (1.117) }[/math] and a reasonable estimator is </math>\hat{θ} = arg min t∈Θ S(t, y). (1.118)</math>

One simple choice for the function is ρ(u, v)=(u−v) 2 . In this case, the estimator is called the least squares estimator. Another choice, which is more difficult mathematically is [math]\displaystyle{ ρ(u, v) = \mid u−v \mid }[/math]. In this case, the estimator is called the least absolute values estimator.

Compare the minimum residual estimator in equation (1.118) with the maximum likelihood estimate ofθ, defined in equation (1.111).

If the Yi are i.i.d., then all gi (θ) are the same, say g(θ).

In common applications, we have covariates, Z1,...,Z n, and the E(Yi ) have a constant form that depends on the covariate: E(Yi ) = g(Zi , θ).

As with solving the maximization of the likelihood, the solution to the minimization problem (1.118) may be obtained by solving

[math]\displaystyle{ ∂S(θ; y) ∂θ = 0. (1.119) }[/math]

Like equation (1.115), equation (1.119) is an estimating equation; that is, its solution, if it exists, is an estimator. There may be various complications, of course; for example, there may be multiple roots of (1.119).

1.5 Probability Statements in Statistical Inference

There are two important instances in statistical inference in which statements about probability are associated with the decisions of the inferential methods. In hypothesis testing, under assumptions about the distributions, we base our inferential methods on probabilities of two types of errors. In confidence intervals the decisions are associated with probability statements about coverage of the parameters. For both cases the probability statements are based on the distribution of a random sample, Y1,...,Y n.

In computational inference, probabilities associated with hypothesis tests or confidence intervals are estimated by simulation of a hypothesized datagenerating process or by resampling of an observed sample.

Tests of Hypotheses

Often statistical inference involves testing a “null” hypothesis, H0, about the parameter. In a simple case, for example, we may test the hypothesis H0 :θ=θ_0 versus an alternative hypothesis H1 :θ=θ_1.

We do not know which hypothesis is true, but we want a statistical test that has a very small probability of rejecting the null hypothesis when it is true and a high probability of rejecting it when it is false. There is a tradeoff between these two decisions, so we will put an upper bound on the probability of rejecting the null hypothesis when it is true (called a “Type I error”), and under that constraint, seek a procedure that minimizes the probability of the other type of error (“Type II”). To be able to bound either of these probabilities, we must know (or make assumptions about) the true underlying data-generating process.

Thinking of the hypotheses in terms of a parameterθthat indexes these two densities byθ 0 andθ 1, for a sample X = x, we have the likelihoods associated with the two hypotheses as L(θ 0; x) and L(θ 1; x). We may be able to define anα-level critical region for nonrandomized tests in terms of the …

Estimators of the form of linear combinations of order statistics, such as the Harrell-Davis or Kaigh-Lachenbruch quantile estimators, are called “L statistics”. In Exercise 1.18 you are asked to study the relative performance of the sample median and the Harrell-Davis estimator as estimators of the population median.

1.8 The Role of Optimization in Inference

Important classes of estimators are defined as points at which some function that involves the parameter and the random variable achieves an optimum with respect to a variable in the role of the parameter in the function. There are, of course, many functions that involve the parameter and the random variable. One example of such a function is the probability density function itself, and as we have seen optimization of this function is the idea behind maximum likelihood estimation.

In the use of function optimization in inference, once the objective function is chosen, observations on the random variable are taken and are then considered to be fixed; the parameter in the function is considered to be a variable (the “decision variable”, in the parlance often used in the literature on optimization). The function is then optimized with respect to the parameter variable. The nature of the function determines the meaning of “optimized”; if the function is the probability density, for example, “optimized” would logically mean “maximized”, which leads to maximum likelihood estimation. In discussing the use of optimization in statistical estimation, we must be careful to distinguish between a symbol that represents a fixed parameter and a symbol that represents a “variable” parameter. When we denote a probability density function as [math]\displaystyle{ p(y \mid θ) }[/math], we generally expect “θ” to represent a fixed, but possibly unknown, parameter. In an estimation method that involves optimizing this function, however,θis a variable placeholder. In the following discussion, we will generally consider a variable t in place ofθ. We also use t 0, t 1, and so on to represent specific fixed values of the variable. In an iterative algorithm, we use t (k) to represent a fixed value in the kth iteration. We do not always do this, however, and sometimes, as other authors do, we will use

Estimation by Maximum Likelihood

One of the most commonly used approaches to statistical estimation is maximum likelihood. The concept has an intuitive appeal, and the estimators based on this approach have a number of desirable mathematical properties, at least for broad classes of distributions. Given a sample y 1,...,y n from a distribution with probability density or probability mass function</math>p(y \mid θ)</math>, a reasonable estimate ofθis the value that maximizes the joint density or joint probability with variable t at the observed sample value:

[math]\displaystyle{ i p(y_i \mid t) }[/math]. We define the likelihood function as a function of a variable in place of the parameter: [math]\displaystyle{ L n(t; y) = fin i=1 p(y_i \mid t). (1.162) }[/math]

Note the reversal in roles of variables and parameters. The likelihood function appears to represent a “posterior probability”, but, as emphasized by R. A. Fisher who made major contributions to the use of the likelihood function in inference, that is not an appropriate interpretation.

Just as in the case of estimation by minimizing residuals, the more difficult and interesting problems involve the determination of the form of the function [math]\displaystyle{ p(y_i \mid θ) }[/math]. In these sections, as above, however, we concentrate on the simpler problem of determining an appropriate value ofθ, assuming that the form of p is known.

The value of t for which L attains its maximum value is the maximum likelihood estimate (MLE) of θ for the given data, y. The data — that is, the realizations of the variables in the density function — are considered as fixed, and the parameters are considered as variables of the optimization problem, max t L n(t; y). (1.163)

This optimization problem can be much more difficult than the optimization problem (1.155) that results from an estimation approach based on minimization of some norm of a residual vector. As we discussed in that case, there can be both computational and statistical problems associated either with restrictions on the set of possible parameter values or with the existence of local optima of the objective function. These problems also occur in maximum likelihood estimation.

Applying constraints in the optimization problem to force the solution to be within the set of possible parameter values is called restricted maximum likelihood estimation, or REML estimation. In addition to problems due to constraints or due to local optima, other problems may arise if the likelihood function is bounded. The conceptual difficulties resulting from an unbounded likelihood are much deeper. In practice, for computing estimates in the unbounded case, the general likelihood principle may be retained, and the optimization problem redefined to include a penalty that keeps the function bounded. Adding a penalty to form a bounded objective function in the optimization problem, or to dampen the solution is called penalized maximum likelihood estimation.

For a broad class of distributions, the maximum likelihood criterion yields estimators with good statistical properties. The conditions that guarantee certain optimality properties are called the “regular case”.

Although in practice, the functions of residuals that are minimized are almost always differentiable, and the optimum occurs at a stationary point, this is often not the case in maximum likelihood estimation. A standard example in which the MLE does not occur at a stationary point is a distribution in which the range depends on the parameter, and the simplest such distribution is the uniform U(0, θ). In this case, the MLE is the max order statistic.

Maximum likelihood estimation is particularly straightforward for distributions in the exponential class, that is, those with PDFs of the form in equation (1.89) on page 35. Whenever Y does not depend onθ, andη(·) and ξ(·) are sufficiently smooth, the MLE has certain optimal statistical properties. This family of probability distributions includes many of the familiar distributions, such as the normal, the binomial, the Poisson, the gamma, the Pareto, and the negative binomial.

The log-likelihood function, [math]\displaystyle{ l L n (θ; y) = logL n(θ; y), (1.164) }[/math] is a sum rather than a product. The form of the log-likelihood in the exponential family is particularly simple:

Introduction to Part II - Statistical Computing

The terms “computational statistics” and “statistical computing” are sometimes used interchangeably. The latter term, however, is often used more specifically to refer to the actual computations, both numerical and nonnumerical. The emphasis of Part II is on the computations themselves.

Statistical computing includes relevant areas of numerical analysis, the most important of which are computer number systems, algorithms and programming, function approximation and numerical quadrature, numerical linear algebra, solution of nonlinear equations and optimization, and generation of random numbers. These topics are the subjects of the individual chapters of Part II.

No matter at what level of detail a statistician needs to be familiar with the computational topics of this part, there are two simple, higher-level facts all statisticians should be aware of:

Computer numbers are not the same as real numbers, and the arithmetic operations on computer numbers are not exactly the same as those of ordinary arithmetic.

and

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

These statements appear word for word in several places in this book. The material in Chapter 2 illustrates the first statement in some detail. As for the second statement, the difference between an expression and a computing method can easily be illustrated by the problem of obtaining the solution to the linear system of equations Ax = b. Assuming A is square and of full rank, the solution can be written as A −1 b. This is a simple expression, and it is certainly appropriate to use it to denote the solution. This expression may imply that to solve the linear system, we first determine A −1 and then multiply it by b on the right to obtain A −1 b. This is not the way to obtain the solution. In Chapter 5 we will describe how the solution should be obtained. It does not involve inverting the matrix. This is just one example that a convenient form of a mathematical expression and the way the expression should be evaluated may be different.

Statistical computations, while motivated by computations in the field of real numbers, IR, do not in actual practice conform to the rules of arithmetic in a field. (Afield is a mathematical structure consisting of a set of two closed, associative, and commutative operations, usually called “addition” that has an additive identity for which each element has an additive inverse, and “multiplication” that has a multiplicative identity for which each element except the additive inverse has a multiplicative inverse, and such that multiplication distributes over addition.) In both the field IR and the computer number arithmetic system, which we will denote as IF, there are two basic operations, and the arithmetic operations in the computer simulate those in IR. A computer engineer may identify a different set of “basic” operations, but those differences are not relevant for our purposes here; the essential facts are that addition on the computer simulates addition in the numbers of interest, IR, and multiplication on the computer simulates multiplication in IR. The result of an arithmetic operation in the computer may not yield the same value as the operation that it simulates. Furthermore, the two important properties of arithmetic in IR, which are common to all fields, that is, associativity of both operations and distributivity of multiplication over addition, do not hold in computer operations. These facts are very significant for statistical computing. The mathematical properties of the two structures IR and IF are important, and they are essential to the elements within each structure. In Chapter 2 we describe standards that computer arithmetic must follow. In these standards there are six basic operations, and the standard requires that each of these operations be correct to within rounding. (Note that the exceptions mentioned above involve more than one operation.)

How much a computer user needs to know about the way the computer works depends on the complexity of the use and the extent to which the necessary operations of the computer have been encapsulated in software that is oriented toward the specific application. Although some of the details we discuss will not be important for the computational scientist or for someone doing routine statistical computing, the consequences of those details are important, and the serious computer user must be at least vaguely aware of the consequences. The fact that multiplying two positive integers on the computer can yield a negative number should cause anyone who programs a computer to take care.

We next address, in Chapter 3, some basic issues related to computations, such as algorithm/data interaction, programming principles and so on. After these two general chapters, the next four chapters address the numerical analysis for the four main classes of problems alluded to above.

Chapter 2: Computer Storage and Arithmetic

Data represent information at various levels. The form of data, whether numbers, characters, or picture elements, provide different perspectives. Data of whatever form are represented by groups of 0s and 1s, called bits from the words “binary” and “digits”. (The word was coined by John Tukey.) For representing simple text (that is, strings of characters with no special representation), the bits are usually taken in groups of eight, called bytes, or in coding rule. Because of the common association of a byte with a character, those two words are often used synonymously.

For representing characters in bytes, “ASCII” (pronounced “askey”, from code widely used. At first only English letters, Arabic numerals, and a few marks of punctuation had codes. Gradually over time more and more symbols were given codified representations. Also, because the common character sets differ from one language to another (both natural languages and computer languages), there are several modifications of the basic ASCII code set. When there is a need for more different characters than can be represented in a byte (2 8 ), codes to associate characters with larger groups of bits are necessary. For compatibility with the commonly used ASCII codes using groups of 8 bits, these codes usually are for groups of 16 bits. These codes for “16-bit characters” are useful for representing characters in some Oriental languages, for example. The Unicode Consortium has developed a 16-bit standard, called Unicode, that is widely used for representing characters from a variety of languages. For any ASCII character, the Unicode representation uses eight leading 0s and then the same eight bits as the ASCII representation.

An important consideration in the choice of a method to represent data is the way data are communicated within a computer and between the computer and peripheral components such as data storage units. Data are usually treated as a fixed-length sequence of bits. The basic grouping of bits in a computer is sometimes called a “word” or a “storage unit”. The lengths of words or storage units commonly used in computers are 32 or 64 bits.

Chapter 3. Algorithms and programming.

We will use the term “algorithm” rather loosely but always in the general sense of a method or a set of instructions for doing something. Formally, an “algorithm” must terminate. Sometimes we may describe an algorithm that may not terminate simply following steps in our description. Whether we expressly say so or not, there should always be a check on the number of steps, and the algorithm should terminate after some large number of steps no matter what. Algorithms are sometimes distinguished as “numerical”, “seminumerical”, and “nonnumerical”, depending on the extent to which operations on real numbers are simulated.

Algorithms and Programs

Algorithms are expressed by means of a flowchart, a series of steps, or in a computer language or pseudolanguage. The expression in a computer language is a source program or module; hence, we sometimes use the words “algorithm” and “program” synonymously.

The program is the set of computer instructions that implement the algorithm. A poor implementation can render a good algorithm useless. A good implementation will preserve the algorithm’s accuracy and efficiency, and will detect data that are inappropriate for the algorithm. A robust algorithm is applicable over a wide rand of data to which it is applied. A robust program, which is more important, is one that will detect input data that are inappropriate either for the algorithm or for the implementation in the given program.

The exact way an algorithm is implemented in a program depends of course on the programming language, but it also may depend on the computer and associated system software. A program that will run on most systems without modification is said to be portable, and this is an important property because most useful programs will be run on a variety of platforms.

The two most important aspects of a computer algorithm are its accuracy and its efficiency. Although each of these concepts appears rather simple on the surface, each is actually fairly complicated, as we shall see.

Data Structures

The efficiency of a program can be greatly affected by the type of structures used to store and operate on the data. As the size and complexity of the problem increase, the importance of appropriate database structures increases. In some data-intensive problems, the data structure can be the single most important determinant of the overall efficiency of a program. The data structure is not just a method of organization of data; it also can identify appropriate algorithms for addressing the problem.

In many applications the data are organized naturally as a list, which is a simple linearly ordered structure. The two main types of lists are stacks, in which data elements are added and removed from the same end (last in first out, LIFO), and queues, in which new data elements are added to one end and elements already in the list are removed from the other end (first in first out, FIFO). In many cases the space allocated to a given list is limited a priori, so as the list grows, another region of storage must be used. This results in a linked list, in which each list except one must contain, in addition to its data elements, a link or pointer to the next list. In the extreme case of this structure, each sublist contains only one piece of data and the link to the next sublist.

The next most basic data structure is a tree, which is a finite set whose elements (called “nodes”) consist of a special element called a “root” and, if there is more than one element, a partition of the remaining elements such that each member of the partition is a tree. If there are no remaining elements, that is, if the tree contains only one element, which by definition is a root, that element is also called a “leaf”. Many problems in modeling, classification, and clustering require a tree data structure (see Chapter 16).

A generalization of the tree is a graph, in which the nodes are usually called “vertices”, and there is no fixed method of partitioning. The other type of component of this structure consists of connections between pairs of vertices, called “edges”. Two vertices may be connected symmetrically, connected asymmetrically, or not connected. Graphs are useful in statistical applications primarily for identification of an appropriate method for addressing a given problem.

There are many variations of these basic structures, including special types of trees and graphs (binary trees, heaps, directed graphs, and so on). An important problem in statistical computing may be another aspect of the data organization, one that relates to the hardware resources. There are various types of storage in the computer, and how fast the data can be accessed in the various types may affect the efficiency of a program. We will not consider these issues in any detail in this book.

Chapter 4. Approximation of functions and numerical quadrature.

Often in applied mathematics, we encounter a functional model that we express as

[math]\displaystyle{ y = f(x), }[/math]

but yet we do not know f fully, or else we cannot evaluate it in closed form.

We may only have some general knowledge about the shape of f and know its value at certain points, f(x 1), f(x 2),.... In that case, our objective may be just to determine some “nice” function, say a polynomial, that goes through those known values. We say it “interpolates” those points.

Even if we know the values at certain points, we may take a slightly different approach in which we determine a nice function that goes “close” to the points, but does not necessarily interpolate them. We say it “smoothes” the points. In this case there is an adjustment term at each of the points that is not interpolated.

Alternatively the function may be parameterized, and we have full knowledge of the function except for the values of the parameters; that is, we form the model as [math]\displaystyle{ y = f(x;θ), }[/math] and we know f(x;θ) all except forθ. If there is a value ofθso that the function interpolates all known values, that value provides us full knowledge of f (conditional on the known values). It may be the case that there is no value ofθsuch that f(x;θ) fits all of the known values [math]\displaystyle{ (x_i, y_i) }[/math]. In this case we must recognize that y = f(x;θ) is not really correct; it needs some kind of adjustment term.

If our objective is to interpolate the known points, there are several ways we can do that, so we set some reasonable criteria about the form of the interpolating function, and then proceed. We will discuss ways of interpolation in this chapter.

If our objective is to smooth the data or if the functional model cannot interpolate all of the known points, then we must deal with an adjustment term. There are two possible approaches. One is set some reasonable criteria about the form of the smoothing function and about what kinds of adjustment terms to allow, and then proceed. That approach is a topic of this chapter; the objective is to approximate the function. In another approach, we assume that the adjustment terms are random variables, and so we can use statistical techniques to estimate the function; that is the topic of Chapter 10.

There are many reasons for approximating a function. One reason for doing this, which we will address in later sections of this chapter, is to evaluate an integral involving the function. Another reason, which we address in Chapter 8, is to draw lines in a graphical display. An additional reason for approximating a function is to put the function in a form that is amenable for estimation; that is, we approximate a function and then estimate the approximating function.

Before proceeding to the main topic of this chapter, that is, methods of approximation of functions, we review some general topics of linear spaces developed in Section 1.2 for the special case of function spaces. This discussion leads to the notions of basis functions in Section 4.2. In Sections 4.3, 4.4, and 4.5, we discuss methods of approximation, first using a basis set of orthogonal polynomials and truncated series in those basis functions, then using finite sets of spline basis functions, and then using a kernel filter. In Sections 4.6 and 4.7 we discuss numerical quadrature. Some methods of quadrature are based on the function approximations.

Inner Products, Norms, and Metrics

The inner products, norms, and metrics that we defined beginning on page 11 are useful in working with functions, but there are a few differences that result from the fact that the “dimension” is essentially uncountable.

The inner product of functions is naturally defined in terms of integrals of the products of the functions, analogously to the definition of the dot product of vectors in equation (1.7). Just as the inner product of vectors is limited to vectors with the same length, the inner product of functions is defined for functions over some fixed range of integration (possibly infinite). …

Chapter 5. Numerical linear algebra.

Many scientific computational problems involve vectors and matrices. It is necessary to work with either the elements of vectors and matrices individually or with the arrays themselves. Programming languages such as C provide the capabilities for working with the individual elements but not directly with the arrays. Fortran and higher-level languages such as Octave or Matlab and R allow direct manipulation with vectors and matrices.

The distinction between the set of real numbers, IR, and the set of floatingpoint numbers, IF, that we use in the computer has important implications for numerical computations. An element x of a vector or matrix is approximated by [x] c, and a mathematical operation◦is simulated by a computer operation [math]\displaystyle{ [◦]c }[/math]. As we emphasized in Section 2.2, the familiar laws of algebra for the field of the reals do not hold in IF.

These distinctions, of course, carry over to arrays of floating-point numbers that represent real numbers, and the mathematical properties of vectors and matrices may not hold for their computer counterparts. For example, the dot product of a nonzero vector with itself is positive, but [math]\displaystyle{ �xc, xc� c = 0 }[/math] does not imply [math]\displaystyle{ x_c = 0 }[/math]. (This is reminiscent of the considerations that led us to discuss pseudonorms on page 149, but the issues here are entirely different.)

The elements of vectors and matrices are represented as ordinary numeric data in either fixed-point or floating-point representation. In the following, we will consider the floating-point representation and the computations in IF.

Storage Modes

The elements of an array are generally stored in a logically contiguous area of the computer’s memory. What is logically contiguous may not be physically contiguous, however.

Because accessing data from memory in a single pipeline may take more computer time than the computations themselves, computer memory may be organized into separate modules, or banks, with separate paths to the central processing unit. Logical memory is interleaved through the banks; that is, two consecutive logical memory locations are in separate banks. In order to take maximum advantage of the computing power, it may be necessary to be aware of how many interleaved banks the computer system has, but we will not consider such details here.

There are no convenient mappings of computer memory that would allow matrices to be stored in a logical rectangular grid, so matrices are usually stored either as columns strung end-to-end (a “column-major” storage) or as rows strung end-to-end (a “row-major” storage). Sometimes it is necessary to know which way the matrix is stored in the computer’s logical address space; that is, whether ai,j is stored logically next to ai+1,j or to ai,j+1. (Physically, in the hardware, it may be next to neither of these.)

For some software to deal with matrices of varying sizes, the user must specify the length of one dimension of the array containing the matrix. (In general, the user must specify the lengths of all dimensions of the array except one.) In Fortran subroutines, it is common to have an argument specifying the leading dimension (number of rows), and in C functions it is common to have an argument specifying the column dimension. In an object-oriented system, this information is bundled in the object, and it is the object itself (the matrix, rather than a computer memory address) that is passed from one program module to another.

Chapter 6. Solution of nonlinear equations and optimization.

As we discussed in Section 1.8, most problems in statistical inference can be posed as optimization problems. An optimization problem is to solve arg min x∈D f(x) (6.1) for x. The scalar-valued function f is called the objective function. The variable x, which is usually a vector, is called the decision variable, and its elements are called decision variables. The domain D of the decision variables is called the feasible set.

In this chapter, after some preliminary general issues, we will consider different methods, which depend on the nature of D. First we consider problems in which D is continuous. The methods for continuous D often involve solving nonlinear equations, so we discuss techniques for solving equations in Section 6.1, before going on to the topic of optimization over continuous domains in Section 6.2. In Section 6.3 we consider the problem of optimization over a discrete domain. We mention a variety of methods, but consider only one of these, simulated annealing, in any detail. In Section 6.2 we assume that D = IR m, that is, we assume that there are no constraints on the decision variables, and in Section 6.3 we likewise generally ignore any constraints imposed by D. In Section 6.4, we consider the necessary changes to accommodate constraints in D. In the final sections we consider some specific types of optimization problems that arise in statistical applications.

Categorizations of Optimization Problems

This basic problem (6.1) has many variations. A simple variation is the maximization problem, which is addressed by using−f(x). The general methods do not depend on whether we are minimizing or maximizing. We use the term “optimum” generally to mean either a minimum or maximum, depending on how the problem is stated.

If D is essentially irrelevant (that is, if D includes all x for which f(x) is defined), the problem is called an unconstrained optimization problem. Otherwise, it is a constrained optimization problem.

Two distinct classes of problems can be associated with the cardinality of D. In one class, D is continuous, dense, and uncountable. In the other class, D is discrete and countable. The techniques for solving problems in these two classes are quite different, although techniques for one type can be used as approximations in the other type.

In this chapter, we will address all of these types of problems, but not with equal time. We will give most attention to the case of continuous D. The nature of f determines how the problem must be addressed, and this is especially true in the case of continuous D. If f is linear, the unconstrained problem is either trivial or ill-posed. If f is linear and the problem is constrained, the nature of D determines what kinds of algorithms are appropriate. If D is a convex polytope, for example, the problem is a linear program. The simplex method, which is a discrete stepping algorithm, is the most common way to solve linear programming problems. (This method was chosen as one of the Top 10 algorithms of the twentieth century; see page 138.) Linear programs arise in a limited number of statistical applications, for example, in linear regression fitting by minimizing either the L_1 or the L_∞ norm of the residuals. We will only briefly discuss linear programming in this chapter.

In the more general case of f over a continuous domain, the continuity of f is probably the most important issue. If f is arbitrarily discontinuous, there is very little that can be done in a systematic fashion. For general methods in continuous domains, we will limit our consideration to continuous objective functions. After continuity, the next obvious issue is the differentiability of f, and for some methods, we make assumptions about its differentiability. These assumptions are of two types: about existence, which affects theoretical properties of optimization methods, and, assuming they exist, about the cost of computation of derivatives, which affects the choice of method for solving the problem.

The other broad class of optimization problems are those in which D is discrete and countable. These are essentially problems in combinatorics. They must generally be attacked in very different ways from the approaches used in continuous domains.

Optimization problems over either dense or countable domains may have more than one solution; that is, there may be more than one point at which the minimum is attained. In the case of dense domains, the set of optimal points may even be uncountable. A more common case is one in which within certain regions of the domain, there is a local optimum, and within another region, there is another local optimum. We use the terms “global optimum” to refer to an optimum over the domain D. We use the term “local optimum” …

Chapter 7. Generation of random numbers.

Monte Carlo simulation is a core technology in computational statistics. Monte Carlo methods require numbers that appear to be realizations of random variables. Obtaining these numbers is the process called “generation of random numbers”.

Our objective is usually not to generate a truly random sample. Deep understanding of the generation process and strict reproducibility of any application involving the “random” numbers is more important. We often emphasize this perspective by the word “pseudorandom”, although almost anytime we use a phrase similar to “generation of random numbers”, we refer to “pseudorandom” numbers.

The quality of a process for random number generation is measured by the extent to which the sample generated appears, from every imaginable perspective, to be a random sample (that is, i.i.d.) from a given probability distribution. Some methods of random number generation are better than others.

7.1 Randomness of Pseudorandom Numbers

The initial step in random number generation is to obtain a sequence that appears to be independent realizations from a uniform distribution over the open interval (0, 1). We denote this distribution by U(0, 1).

While mathematically there is no difference in a continuous distribution over [0, 1]⊂IR and one over (0, 1)⊂IR, there is a difference in a distribution over [0, 1]⊂IF and over (0, 1)⊂IF. Because of the computations we may perform with samples that appear to be from U(0, 1), we must make sure that we exclude the zero-probability events of 0 and 1. (See Exercise 2.9a and its solution on page 677 for a different situation, which may superficially appear to be the same as this.)

Introduction to Part III - Methods of Computational Statistics

The field of computational statistics includes a set of statistical methods that are computationally intensive. These methods may involve looking at data from many different perspectives and looking at various subsets of the data. Even for moderately sized datasets, the multiple analyses may result in a large number of computations. Statistical methods may be computationally intensive also because the dataset is extremely large. With the ability to collect data automatically, ever-larger datasets are available for analysis.

Viewing data from various perspectives often involves transformations such as projections onto multiple lower-dimensional spaces. Interesting datasets may consist of subsets that are different in some important way from other subsets of the given data. The identification of different subsets and the properties that distinguish them is computationally intensive because of the large number of possible combinations.

Another type of computationally intensive method useful in a wide range of applications involves simulation of the data-generating process. Study of many sets of artificially generated data helps to understand the process that generates real data. This is an exciting method of computational statistics because of the inherent possibilities of unexpected discoveries through experimentation.

Monte Carlo experimentation is the use of simulated random numbers to estimate some functional of a probability distribution. In simple applications of Monte Carlo, a problem that does not naturally have a stochastic component may be posed as a problem with a component that can be identified with an expectation of some function of a stochastic variable. The problem is then solved by estimating the expected value by use of a simulated sample from the distribution of a random variable. In such applications, Monte Carlo methods are similar to other methods of numerical analysis.

Monte Carlo methods differ from other methods of numerical analysis, however, in yielding an estimate rather than an approximation. The “numerical error” in a Monte Carlo estimate is due to a pseudovariance associated with a pseudorandom variable; but the numerical error in standard numerical analysis is associated with approximations, including discretization, truncation, and roundoff. Monte Carlo methods can also be used to make inferences about parameters of models and to study random processes. In statistical inference, real data are used to estimate parameters of models and to study random processes assumed to have generated the data. Some of the statistical methods discussed in Part III use simulated data in the analysis of real data. There are several ways this can be done.

If the simulated data are used just to estimate one or more parameters, rather than to study the probability model more generally, we generally use the term Monte Carlo to refer to the method. Whenever simulated data are used in the broader problem of studying the complete process and building models, the method is often called simulation. This distinction between a simulation method and a Monte Carlo method is by no means universally employed; and we will sometimes use the terms “simulation” and “Monte Carlo” synonymously.

In either simulation or Monte Carlo, an actual dataset may be available; but it may be supplemented with artificially generated data. The term “resampling” is related to both “simulation” and “Monte Carlo”, and some authors use it synonymously with one or both of the other terms. In this text, we generally use the term “resampling” to refer to a method in which random subsamples are generated from a given dataset; that is, there is no additional artificially generated data.

In the chapters in Part III, we discuss the general methods of computational statistics. These include:

  • graphical methods;
  • projection and other methods of transforming data and approximating functions;
  • Monte Carlo methods and simulation;
  • randomization and use of subsets of the data;
  • bootstrap methods.

Some of the chapters in Part III have close correspondence to chapters in Part II. The methods of Chapter 9 rely heavily on Chapter 5; those of Chapter 10, on Chapter 4; and those of Chapter 11, on Chapter 7. Of course, the basic methods of random number generation (Chapter 7) underlie many of the methods of computational statistics.

Chapter 8. Graphical methods in computational statistics.

One of the first steps in attempting to understand data is to visualize it. Visualization of data and information provides a wealth of tools that can be used in detecting features, in discovering relationships, and finally in retaining the knowledge gained.

Graphical displays have always been an important part of statistical data analysis, but with the continuing developments in high-speed computers and high-resolution devices, the usefulness of graphics has greatly increased. Higher resolution makes for a more visually pleasing display, and occasionally it allows features to be seen that could not be distinguished otherwise. The most important effects of the computer on graphical methods in statistics, however, arise from the ease and speed with which graphical displays can be produced, rather than from the resolution. Rapid production of graphical displays has introduced motion and articulated projections and sections into statistical graphics. Such graphical methods are important tools of computational statistics. The multiple views are tools of discovery, not just ways of displaying a set of data that has already been analyzed. Although the power of graphical displays has greatly increased, some of the most useful graphs are the simple ones, as illustrated in Section 1.1, and they should not be ignored just because we can do more impressive things. Proper design of a graphical display depends on the context of the application and the purpose of the graphics, whether it is for the analyst to get a better understanding of the data or to present a picture that conveys a message. Our emphasis in the following discussion is on methods useful in exploratory graphics.

One thing that is lagging in the statistical literature is the use of color in graphical displays. The simple mechanics used in producing popular magazines are yet to be incorporated in the production of learned journals. Journals available in electronic form do not have these production problems, and it is likely that the paper versions of many journals will be discontinued before the production of color is mastered.

Chapter 9. Tools for identification of structure in data.

In recent years, with our increased ability to collect and store data, have come enormous datasets. These datasets may consist of billions of observations and millions of variables. Some of the classical methods of statistical inference, in which a parametric model is studied, are neither feasible nor relevant for analysis of these datasets. The objective is to identify interesting structures in the data, such as clusters of observations, or relationships among the variables. Sometimes, the structures allow a reduction in the dimensionality of the data. Many of the classical methods of multivariate analysis, such as principal components analysis, factor analysis, canonical correlations analysis, and multidimensional scaling, are useful in identifying interesting structures. These methods generally attempt to combine variables in such a way as to preserve information yet reduce the dimension of the dataset. Dimension reduction generally carries a loss of some information. Whether the lost information is important is the major concern in dimension reduction.

Another set of methods for reducing the complexity of a dataset attempts to group observations together, combining observations, as it were.

In the following we will assume that an observation consists of a vector [math]\displaystyle{ x = (x_1,...,x_m) }[/math]. In most cases, we will assume that [math]\displaystyle{ x∈IR m }[/math]. In statistical analysis, we generally assume that we have n observations, and we use X to denote an n × m matrix in which the rows correspond to observations.

In practice, it is common for one or more of the components of x to be measured on a nominal scale; that is, one or more of the variables represents membership in some particular group within a countable class of groups. We refer to such variables as “categorical variables”. Although sometimes it is important to make finer distinctions among types of variables (see Stevens, 1946, who identified nominal, ordinal, interval, and ratio types), we often need to make a simple distinction between variables whose values can be modeled by IR and those whose values essentially indicate membership in a particular group. We may represent the observation x as being composed of these two types, “real” or “numerical”, and “categorical”:

[math]\displaystyle{ x = (x_r, x_c). }[/math]

In the following, we often use the phrase “numerical data” to indicate that each element of the vector variable takes on values in IR, that the relevant operations of IR are available, and that the properties of the reals apply.

Major concerns for methods of identifying structure are the number of computations and amount of storage required.

In this chapter, we introduce some of the tools that are used for identifying structure in data. There are two distinct tools: transformations of data, and internal measures of structure. Although these two topics are to some extent independent, transformations may change the internal measures or may help us to use them more effectively. Transformations also play important roles in exploration of data, as in the graphical displays discussed in Chapter 8. In Chapter 16, using the tools covered in this chapter, we discuss various methods of exploring data.

Linear Structure and Other Geometric Properties

Numerical data can conveniently be represented as geometric vectors. We can speak of the length of a vector, or of the angle between two vectors, and relate these geometric characteristics to properties of the data. We will begin with definitions of a few basic terms.

The Euclidean length or just the length of an n-vector x is the square root of the sum of the squares of the elements of the vector. We generally denote the Euclidean length of x as �x� 2 or just as �x�: �x� = � �n i=1 x 2 i �1/2 .

The Euclidean length is a special case of a more general real-valued function of a vector called a “norm”, which is defined on page 13.

The angle θ between the vectors x and y is defined in terms of the cosine by cos(θ) = �x, y� � �x, x��y, y� . (See Figure 9.1.)

Linear structures in data are the simplest and often the most interesting. Linear relationships can also be used to approximate other more complicated structures.

Flats

The set of points x whose components satisfy a linear equation

Chapter 10. Estimation of functions.

An interesting problem in statistics, and one that is generally difficult, is the estimation of a continuous function, such as a probability density function or a nonlinear regression model. The statistical properties of an estimator of a function are more complicated than statistical properties of an estimator of a single parameter or of a countable set of parameters.

In Chapter 4 we discussed ways of numerically approximating functions. In this brief chapter we will discuss ways of statistically estimating functions. Many of these methods are based on approximation methods such as orthogonal systems, splines, and kernels discussed in Chapter 4. The PDF decomposition plays an important role in the estimation of functions.

We will discuss the properties of an estimator in the general case of a real scalar-valued function over real vector-valued arguments (that is, a mapping from [math]\displaystyle{ IR_d }[/math] into IR). One of the most common situations in which these properties are relevant is in nonparametric probability density estimation, which we discuss in Chapter 15. (In that application, of course, we do not have to do a PDF decomposition.) The global statistical properties we discuss in Section 10.3 are the measures by which we evaluate probability density estimators.

First, we say a few words about notation. We may denote a function by a single letter, f, for example, or by the function notation, f(·) or f(x). When f(x) denotes a function, x is merely a placeholder. The notation f(x), however, may also refer to the value of the function at the point x. The meaning is usually clear from the context.

Using the common “hat” notation for an estimator, we use [math]\displaystyle{ f� }[/math] or [math]\displaystyle{ f�(x) }[/math] to denote the estimator of f or of f(x). Following the usual terminology, we use the term “estimator” to denote a random variable, and “estimate” to denote a realization of the random variable.

The hat notation is also used to denote an estimate, so we must determine from the context whether f �or f �(x) denotes a random variable or a realization of a random variable.

The estimate or the estimator of the value of the function at the point x may also be denoted by f �(x). Sometimes, to emphasize that we are estimating the ordinate of the function rather than evaluating an estimate of the function, we use the notation f �(x). In this case also, we often make no distinction in the notation between the realization (the estimate) and the random variable (the estimator). We must determine from the context whether f �(x) or f �(x) denotes a random variable or a realization of a random variable. In most of the following discussion, however, the hat notation denotes a random variable. Its distribution depends on the underlying random variable that yields the sample from which the estimator is computed.

The usual optimality properties that we use in developing a theory of estimation of a finite-dimensional parameter must be extended for estimation of a general function. As we will see, two of the usual desirable properties of point estimators, namely unbiasedness and maximum likelihood, cannot be attained globally or in general by estimators of functions.

There are many similarities in estimation of functions and approximation of functions, but we must be aware of the fundamental differences in the two problems. Estimation of functions is similar to other estimation problems: We are given a sample of observations; we make certain assumptions about the probability distribution of the sample; and then we develop estimators. The estimators are random variables, and how useful they are depends on properties of their distribution, such as their expected values and their variances.

Approximation of functions is an important aspect of numerical analysis.]]Function]]s are often approximated to interpolate functional values between directly computed or known values. Functions are also approximated as a prelude to quadrature. In this chapter, we will often approximate a function as a step in the statistical estimation of the function.

In the problem of function estimation, we may have observations on the function at specific points in the domain, or we may have indirect measurements of the function, such as observations that relate to a derivative or an integral of the function. In either case, the problem of function estimation has the competing goals of providing a good fit to the observed data and predicting values at other points. In many cases, a smooth estimate satisfies this latter objective. In other cases, however, the unknown function itself is not smooth. Functions with different forms may govern the phenomena in different regimes. This presents a very difficult problem in function estimation, but it is one that we will not consider in any detail here.

There are various approaches to estimating functions. Maximum likelihood (see page 70) has limited usefulness for estimating functions because in general the likelihood is unbounded. A practical approach is to assume that the function is of a particular form and estimate the parameters that characterize the form. For example, we may assume that the function is exponential, possibly because of physical properties such as exponential decay. We may then use various estimation criteria, such as least squares, to estimate the parameter. An extension of this approach is to assume that the function is a mixture of other functions. The mixture can be formed by different functions over different domains or by weighted averages of the functions over the whole domain. Estimation of the function of interest involves estimation of various parameters as well as the weights.

Another approach to function estimation is to represent the function of interest as a linear combination of basis functions, that is, to represent the function in a series expansion. The basis functions are generally chosen to be orthogonal over the domain of interest, and the observed data are used to estimate the coefficients in the series. We discuss the use of basis functions beginning on page 18 and again on page 161.

It is often more practical to estimate the function value at a given point. (Of course, if we can estimate the function at any given point, we can effectively have an estimate at all points.) One way of forming an estimate of a function at a given point is to take the average at that point of a filtering function that is evaluated in the vicinity of each data point. The filtering function is called a kernel, and the result of this approach is called a kernel estimator. We discussed use of kernels in approximation of functions in Section 4.5. Kernel methods have limited use in function approximation, but they are very useful in function estimation. We briefly discuss the use of kernel filters in function estimation on page 406, but we discuss those kinds of methods more fully in the context of probability density function estimation in Section 15.3, beginning on page 499.

In the estimation of functions, we must be concerned about the properties of the estimators at specific points and also about properties over the full domain. Global properties over the full domain are often defined in terms of integrals or in terms of suprema or infima.

10.1 General Approaches to Function Estimation

The theory of statistical estimation is based on probability distributions. In order to develop an estimation procedure, we need to identify random variables, and make some assumptions about their distributions. In the case of statistical estimation of a function, this may involve decomposing the function of interest so as to have a factor that is a PDF. This PDF decomposition is a preliminary step in function estimation. Once a random variable and a probability distribution are identified, methods for estimation of functions often parallel the methods of approximation of functions as in Chapter 4.

Function Decomposition and Estimation of the Coefficients in an Orthogonal Expansion

In the following, we will work with functions that are square-integrable over some domain D; that is, functions in L 2 (D).

10.2 Pointwise Properties of Function Estimators

The statistical properties of an estimator of a function at a given point are analogous to the usual statistical properties of an estimator of a scalar parameter. The statistical properties involve expectations or other properties of random variables. In the following, when we write an expectation, E(·), or a variance, V(·), the expectations are usually taken with respect to the (unknown) distribution of the underlying random variable. Occasionally, we may explicitly indicate the distribution by writing, for example, Ep(·), where p is the density of the random variable with respect to which the expectation is taken.

Bias

The bias of the estimator of a function value at the point x is

[math]\displaystyle{ E � f �(x) −f(x). }[/math]

If this bias is zero, we would say that the estimator is unbiased at the point x. If the estimator is unbiased at every point x in the domain of f, we say that the estimator is pointwise unbiased. Obviously, in order for f �(·) to be pointwise unbiased, it must be defined over the full domain of f.

Variance

The variance of the estimator at the point x is

[math]\displaystyle{ V � f �(x) = E �� f �(x)−E � f �(x) 2 � . }[/math]

Estimators with small variance are generally more desirable, and an optimal estimator is often taken as the one with smallest variance among a class of unbiased estimators.

Mean Squared Error

The mean squared error, MSE, at the point x is

[math]\displaystyle{ MSE � f �(x) = E �� f �(x)−f(x) 2 � . (10.9) }[/math]

The mean squared error is the sum of the variance and the square of the bias:

[math]\displaystyle{ MSE � f �(x) = E �� f �(x) 2 −2f �(x)f(x)+(f(x)) 2 � = V � f �(x) + � E � f �(x) −f(x) 2 . (10.10) }[/math]

Sometimes, the variance of an unbiased estimator is much greater than that of an estimator that is only slightly biased, so it is often appropriate to compare the mean squared error of the two estimators. In some cases, as we will see, unbiased estimators do not exist, so rather than seek an unbiased estimator with a small variance, we seek an estimator with a small MSE.

Mean Absolute Error

The mean absolute error, MAE, at the point x is similar to the MSE: MAE � f �(x) = E �! f �(x)−f(x) ! . (10.11)</math> It is more difficult to do mathematical analysis of the MAE than it is for the MSE. Furthermore, the MAE does not have a simple decomposition into other meaningful quantities similar to the MSE.

Consistency

Consistency of an estimator refers to the convergence of the expected value of the estimator to what is being estimated as the sample size increases without bound. A point estimator T n, based on a sample of size n, is consistent forθ if

Chapter 11. Monte Carlo methods for statistical inference.

Monte Carlo methods are experiments. Monte Carlo experimentation is the use of simulated random numbers to estimate some functional of a probability distribution. A problem that does not have a stochastic component can sometimes be posed as a problem with a component that can be identified with an expectation of some function of a random variable. This is often done by means of a PDF decomposition. The problem is then solved by estimating the expected value by use of a simulated sample from the distribution of the random variable.

Monte Carlo methods use random numbers, so to implement a Monte Carlo method it is necessary to have a source of random numbers. On the computer, we generally settle for pseudorandom numbers, that is, numbers that appear to be random but are actually deterministic. Generation of pseudorandom numbers is the topic of Chapter 7.

Often, our objective is not to simulate random sampling directly, but rather to estimate a specific quantity related to the distribution of a given sample. In this case, we may want to ensure that a chosen sample closely re- flects the distribution of the population we are simulating. Because of random variation, a truly random sample or a pseudorandom sample that simulates a random sample would not necessarily have this property. Sometimes, therefore, we generate a quasirandom sample, which is a sample constrained to re- flect closely the distribution of the population we are simulating, rather than to exhibit the variability that would result from random sampling. Because in either case we proceed to treat the samples as if they were random, we will refer to both pseudorandom numbers and quasirandom numbers as “random numbers”, except when we wish to emphasize the “pseudo” or “quasi” nature.

In this chapter, we discuss various ways random numbers are used in statistical inference. Monte Carlo methods are also used in many of the techniques described in other chapters.

Chapter 12. Data randomization, partitioning, and augmentation.

Although subsampling, resampling, or otherwise rearranging a given dataset cannot increase its information content, these procedures can sometimes be useful in extracting information. Randomly rearranging the observed dataset, for example, can give an indication of how unusual the dataset is with respect to a given null hypothesis. This idea leads to randomization tests.

There are many useful procedures for data analysis that involve partitioning the original sample. Using subsets of the full sample, we may be able to get an estimate of the bias or the variance of the standard estimator or test statistic without relying too heavily on the assumptions that led to that choice of estimator or test statistic. It is often useful to partition the dataset into two parts and use the data in the “training set” or “estimation set” to arrive at a preliminary estimate or fit and then use the data in the “validation set” or “test set” to evaluate the fit. This kind of approach is particularly appropriate when we are unsure of our model of the data-generating process. In actual applications, of course, we are always at least somewhat unsure of our model. If the full dataset is used to fit the model, we are very limited in the extent to which we can validate the model.

No matter what we do with a given dataset, we are still left with uncertainty about the relevance of that dataset to future modeling problems. Prediction requires some assumption about the model of the data-generating processes, both the one that yielded the given data and the unseen one for which inferences are to be made. The variance of predictors is called prediction error or generalization error. Obviously, since this involves unseen data and unknown scenarios, there is no way of measuring or estimating this kind of error with any confidence. The use of partitions of the given dataset, however, is one way of getting some feel for the generalization error for scenarios that are somewhat similar to the one that gave rise to the observed dataset.

Subsets of the data can be formed systematically or they can be formed as random samples from the given dataset. Sometimes the given dataset is viewed as a set of mass points of a finite distribution whose distribution function is the same as the empirical distribution function of the given dataset. In this case, the data partitioning may be done in such a way that observations may occur multiple times in the various partitions. In most cases, when we speak of “sets” or “subsets”, we mean “multisets” (that is, collections in which items may occur more than once and are distinguished in each occurrence).

12.1 Randomization Methods

Chapter 13. Bootstrap methods.

Resampling methods involve the use of many samples, each taken from a single sample that was taken from the population of interest. Inference based on resampling makes use of the conditional sampling distribution of a new sample (the “resample”) drawn from a given sample. Statistical functions on the given sample, a finite set, can easily be evaluated. Resampling methods therefore can be useful even when very little is known about the underlying distribution.

A basic idea in bootstrap resampling is that, because the observed sample contains all the available information about the underlying population, the observed sample can be considered to be the population; hence, the distribution of any relevant test statistic can be simulated by using random samples from the “population” consisting of the original sample. Suppose that a sample y 1,...,y n is to be used to estimate a population parameter,θ. For a statistic T that estimatesθ, as usual, we wish to know the sampling distribution so as to correct for any bias in our estimator or to set confidence intervals for our estimate ofθ. The sampling distribution of T is often intractable in applications of interest.

A basic bootstrapping method formulated by Efron (1979) uses the discrete distribution represented by the sample to study the unknown distribution from which the sample came. The basic tool is the empirical cumulative distribution function. The ECDF is the CDF of the finite population that is used as a model of the underlying population of interest.

The functional of the CDF that defines a parameter defines a plug-in estimator of that parameter when the functional is applied to the ECDF. A functional of a population distribution function,Θ(P), defining a parameter θcan usually be expressed as θ=Θ(P) = � g(y) dP(y). (13.1) The plug-in estimator T is the same functional of the ECDF: T = T(P n) = Θ(P n) = � g(y) dP n(y). (13.2) (In both of these expressions, we are using the integral in a general sense. In the second expression, the integral is a finite sum. It is also a countable sum in the first expression if the random variable is discrete. Note also that we use the same symbol to denote the functional and the random variable.) Various properties of the distribution of T can be estimated by use of “bootstrap samples”, each of the form {y ∗ 1 ,...,y ∗ n }, where the y ∗ i ’s are chosen from the original yi ’s with replacement.

We define a resampling vector, p ∗ , corresponding to each bootstrap sample as the vector of proportions of the elements of the original sample in the given bootstrap sample. The resampling vector is a realization of a random vector P ∗ for which nP ∗ has an n-variate multinomial distribution with parameters n and (1/n, . . . , 1/n). The resampling vector has random components that sum to 1. For example, if the bootstrap sample (y ∗ 1 , y ∗ 2 , y ∗ 3 , y ∗ 4 ) happens to be the sample (y 2, y 2, y 4, y 3), the resampling vector p ∗ is (0, 1/2, 1/4, 1/4).

The bootstrap replication of the estimator T is a function of p ∗ , T(p ∗ ). The resampling vector can be used to estimate the variance of the bootstrap estimator. By imposing constraints on the resampling vector, the variance of the bootstrap estimator can be reduced.

The bootstrap principle involves repeating the process that leads from a population CDF to an ECDF. Taking the ECDF P n to be the CDF of a population, and resampling, we have an ECDF for the new sample, P (1) n . (In this notation, we could write the ECDF of the original sample as P (0) n .) The difference is that we know more about P (1) n than we know about P n. Our knowledge about P (1) n comes from the simple discrete uniform distribution, whereas our knowledge about P n depends on knowledge (or assumptions) about the underlying population.

The bootstrap resampling approach can be used to derive properties of statistics, regardless of whether any resampling is done. Most common uses of the bootstrap involve computer simulation of the resampling; hence, bootstrap methods are usually instances of computational inference.

13.1 Bootstrap Bias Corrections

For an estimator T that is the same functional of the ECDF as the parameter is of the CDF, the problem of bias correction is to find a functional f T that …

Part IV - Exploring Data Density and Relationships

A canonical problem in statistics is to gain understanding of a given random sample,: [math]\displaystyle{ y_1,..., y_n, }[/math] in order to understand better the data-generating process that yielded the data. The specific objective is to make inferences about the population from which the random sample arose. In many cases, we wish to make inferences only about some finite set of parameters, such as the mean and variance, that describe the population. In other cases, we want to predict a future value of an observation. Sometimes, the objective is more difficult: We want to estimate a function that characterizes the distribution of the population. The cumulative distribution function (CDF) or the probability density function (PDF) provides a complete description of the population, so we may wish to estimate these functions.

In the simpler cases of statistical inference, we assume that the form of the CDF P is known and that there is a parameter,θ=Θ(P), of finite dimension that characterizes the distribution within that assumed family of forms. An objective in such cases may be to determine an estimateθ � of the parameterθ. The parameter may completely characterize the probability distribution of the population, or it may just determine an important property of the distribution, such as its mean or median. If the distribution or density function is assumed to be known up to a vector of parameters, the complete description is provided by the parameter estimate. For example, if the distribution is assumed to be normal, then the form of P is known. It involves two parameters, the mean μand the varianceσ 2 . The problem of completely describing the distribution is merely the problem of estimatingθ= (μ, σ 2 ). In this case, the estimates of the CDF, P�, and the density, p�, are the normal CDF and density with the estimate of the parameter,θ �, plugged in.

If no assumptions, or only weak assumptions, are made about the form of the distribution or density function, the estimation problem is much more difficult. Because the distribution function or density function is a characterization from which all other properties of the distribution could be determined, we expect the estimation of the function to be the most difficult type of statistical inference. “Most difficult” is clearly a heuristic concept and here may mean that the estimator is most biased, most variable, most difficult to compute, most mathematically intractable, and so on.

Estimators such asθ � for the parameterθor p� for the density p are usually random variables; hence, we are interested in the statistical properties of these estimators. If our approach to the problem treatsθand p as fixed (but unknown), then the distribution ofθ � and p� can be used to make informative statements aboutθand p. Alternatively, ifθand p are viewed as realizations of random variables, then the distribution ofθ � and p� can be used to make informative statements about conditional distributions of the parameter and the function, given the observed data.

Although the CDF in some ways is more fundamental in characterizing a probability distribution (it always exists and is defined the same for both continuous and discrete distributions), the probability density function is more familiar to most data analysts. Important properties such as skewness, modes, and so on can be seen more readily from a plot of the probability density function than from a plot of the CDF. We are therefore usually more interested in estimating the density, p, than the CDF, P. Some methods of estimating the density, however, are based on estimates of the CDF. The simplest estimate of the CDF is the empirical cumulative distribution function, the ECDF, which is defined as :[math]\displaystyle{ P_n(y) = 1 n fin i=1 I (−∞,y] (y_i). }[/math] (See page 669 for the definition and properties of the indicator function</math>I_S(·)</math> in the ECDF.) As we have seen on page 59, the ECDF is pointwise unbiased for the CDF.

The derivative of the ECDF, the empirical probability density function (EPDF), [math]\displaystyle{ p_n(y) = 1 n fin i=1 \theta(y−y_i), }[/math] where [math]\displaystyle{ \theta }[/math] is the Dirac delta function, is just a series of spikes at points corresponding to the observed values. It is not very useful as an estimator of the probability density. It is, however, unbiased for the probability density function at any point.

In the absence of assumptions about the form of the density p, the estimation problem may be computationally intensive. A very large sample is usually required in order to get a reliable estimate of the density. The goodness of the estimate depends on the dimension of the random variable. Heuristically, the higher the dimension, the larger the sample required to provide adequate representation of the sample space.

Density estimation generally has more modest goals than the development of a mathematical expression that yields the probability density function p everywhere. Although we may develop such an expression, the purpose of the estimate is usually a more general understanding of the population:

  • to identify structure in the population, its modality, tail behavior, skewness, and so on;
  • to classify the data and to identify different subpopulations giving rise to it; or
  • to make a visual presentation that represents the population density.

There are several ways to approach the probability density estimation problem. In a parametric approach mentioned above, the parametric family of distributions, such as a normal distribution or a beta distribution, is assumed. The density is estimated by estimating the parameters of the distribution and substituting the estimates into the expression for the density. In a nonparametric approach, only very general assumptions about the distribution are made. These assumptions may only address the shape of the distribution, such as an assumption of unimodality or an assumption of continuity or other degrees of smoothness of the density function. There are various semiparametric approaches in which, for example, parametric assumptions may be made only over a subset of the range of the distribution, or, in a multivariate case, a parametric approach may be taken for some elements of the random vector and a nonparametric approach for others. Another approach is to assume a more general family of distributions, perhaps characterized by a differential equation, for example, and to fit the equation by equating properties of the sample, such as sample moments, with the corresponding properties of the equation.

In the case of parametric estimation, we have a complete estimate of the density (that is, an estimate at all points). In nonparametric estimation, we generally develop estimates of the ordinate of the density function at specific points. After the estimates are available at given points, a smooth function can be fitted.

Chapters 14 and 15 in this part address the problem of estimating a probability density function, either parametrically or nonparametrically. The probability density of a data-generating process is one of the most important determinants of the structure in data, and it is by studying some aspects of that structure that we make inferences about the probability density.

Chapter 16 considers the problems of finding other structure in the data. Many of the methods we discuss are sensitive to artificial structure, which, similar to artificial ill-conditioning that we discussed on page 208, is structure that can be removed by univariately scaling the data. Scaling has implications not only for numerical computations; it also affects the results of many multivariate analyses, even if the computations are exact.

It is now common to search through datasets and compute summary statistics from various items that may indicate relationships that were not previously recognized. The individual items or the relationships among them may not have been of primary interest when the data were originally collected. This process of prowling through large datsets is sometimes called data mining or knowledge discovery in databases (KDD). (The names come and go with current fads; there is very little of substance indicated by use of different names. The meaning of “large” in the phrase “large datasets” becomes ever more restrictive as the capacity of computer systems grows.) The objective is to discover characteristics of the data that may not be expected based on the existing theory. In the language of the database literature, the specific goals of data mining are:

  • classification of observations;
  • linkage analysis;
  • deviation detection;

and finally

  • predictive modeling.

Of course, the first three of these are the objectives of any exploratory statistical data analysis. Data mining is exploratory data analysis (EDA) applied to large datasets. An objective of an exploratory analysis is often to generate hypotheses, and exploratory analyses are generally followed by more formal confirmatory procedures. The explorations in massive datasets must be performed without much human intervention. Searching algorithms need to have some means of learning and adaptively improving. This will be a major area of research for some time.

Predictive modeling uses inductive reasoning rather than the more common deductive reasoning, which is much easier to automate.

In the statistical classification of observations, the dataset is partitioned recursively. The partitioning results in a classification tree, which is a decision tree, each node of which represents a partition of the dataset. The decision at each node is generally based on the values of a single variable at a time.

The partitioning can also be based on linear combinations of the variables. This is sometimes called “oblique partitioning” because the partitions are not parallel to the axes representing the individual variables. Seeking good linear combinations of variables on which to build oblique partitions is a much more computationally intensive procedure than just using single variables.

Linkage analysis is often the most important activity of data mining. In linkage analysis, relationships among different variables are discovered and analyzed. This step follows partitioning and is the interpretation of the partitions that were formed.

It is also important to identify data that do not fit the patterns that are discovered. The deviation of some subsets of the data often makes it difficult to develop models for the remainder of the data.

In Chapter 17 we consider building models that express asymmetric relationships between variables and then making inferences about those models.

Chapter 14. Estimation of probability density functions using parametric models.

The problem addressed in this chapter is the estimation of an unknown probability density p(y). The way of doing this that we discuss in this chapter is to approximate the unknown distribution by use of familiar parametric functions, and then to estimate the parameters in these functions. While mechanically this is essentially point estimation, there are differences in the objectives. We are more concerned about the estimator of the function p� or the value of the function at the point y, p �(y), than we are about the point estimators of some parameters.

Parametric statistical procedures involve inference about the parameters of a model. In this chapter, although we use parametric models, we can view the methods as nonparametric, in the sense that the role of the parameters is to serve as tuning constants so as to have a density function that corresponds to the density of the observed data. The phrase “nonparametric density estimation”, however, is generally reserved for methods such as we discuss in Chapter 15. In Section 15.6, however, we consider use of parametric models in more obviously nonparametric methods. While we can identify certain procedures that are “parametric”, the classification of other statistical procedures is less clear. “Semiparametric” is sometimes used, but it is generally not a very useful term for describing a statistical procedure.

There are some relatively simple standard distributions that have proven useful for their ability to model the distribution of observed data from many different areas of application. The normal distribution is a good model for symmetric, continuous data from various sources. For skewed data, the lognormal and gamma distributions often work very well. Discrete data are often modeled by the Poisson or binomial distributions. Distributions such as these are families of distributions that have various numbers of parameters to specify the distribution completely. To emphasize that the density is dependent on parameters, we may write the density as [math]\displaystyle{ p(y \mid θ) }[/math], where θ may be a vector. Several of the standard parametric families are shown in Tables B.1 and B.2 beginning on page 660.

A standard way of estimating a density is to identify appropriate characteristics, such as symmetry, modes, range, and so on, choose some well-known parametric distribution that has those characteristics, and then estimate the parameters of that distribution. For example, if the density is known or assumed to be zero below some point, to be unimodal, and to extend without limit along the positive axis, a three-parameter gamma distribution with density : [math]\displaystyle{ p(y \mid α, β, γ) = 1 Γ(α)β α (y−γ) α−1 e −(y−γ)/β, for γ ≤ y, }[/math] may be used to model the data. The three parametersα, β, and γ are then estimated from the data.

If the probability density of interest has a finite range, a beta distribution may be used to model it, and if it has an infinite range at both ends, a normal distribution, a Student’s t distribution, or a stable distribution may be a useful approximation.

14.1 Fitting a Parametric Probability Distribution

Fitting a parametric density to a set of data is done by estimating the parameters. The estimate of the density, [math]\displaystyle{ \hat{p}(y) }[/math] , is formed by substitution of the estimate of the parameters: [math]\displaystyle{ \hat{p}(y) = p(y \mid \hat{θ}). (14.1) }[/math]

There are several ways of estimating the parameters, and for more complicated models there are many issues that go beyond just estimating the parameters. Many of the methods of fitting the model involve minimization of residuals. To fit a parametric probability density, the most common ways of estimating the parameters are maximum likelihood, matching moments, and matching quantiles.

Maximum Likelihood Methods

The method of maximum likelihood involves the use of a likelihood function that comes from the joint density for a random sample. If p(y \mid θ) is the underlying density, the joint density is just

i p(y_i \mid θ). The likelihood is a function of the parameter θ: L(θ; y 1,...,y n) = � i p(y_i \mid θ). Note the reversal in roles of variables and parameters.

The mode of the likelihood (that is, the value of θ for which L attains its maximum value) is the maximum likelihood estimate of θ for the given data, y. The data, which are realizations of the variables in the density function, are considered as fixed and the parameters are considered as variables of the optimization problem in maximum likelihood methods.

Chapter 15. Nonparametric estimation of probability density functions.

[[Estimation of a probability density function is similar to the estimation of any function, and the properties of the function estimators that we have discussed are relevant for density function estimators. A density function p(y) is characterized by two properties:

In this chapter, we consider several nonparametric estimators of a density; that is, estimators of a general nonnegative function that integrates to 1 and for which we make no assumptions about a functional form other than, perhaps, smoothness.

It seems reasonable that we require the density estimate to have the characteristic properties of a density:

  • p�(y)≥0 for all y;
  • I\Rd p�(y) dy = 1.

A probability density estimator that is nonnegative and integrates to 1 is called a bona fide estimator.

Rosenblatt (1956) showed that no unbiased bona fide estimator can exist for all continuous p. Rather than requiring an unbiased estimator that cannot be a bona fide estimator, we generally seek a bona fide estimator with small mean squared error or a sequence of bona fide estimators p� n that are asymptotically unbiased; that is, Ep(p� n(y))→p(y) for all y∈IR d as n→ ∞.

15.1 The Likelihood Function

Suppose that we have a random sample, y_1,...,y_n, from a population with density p. Treating the density p as a variable, we write the likelihood functional as …

15.3 Kernel Estimators

Kernel methods are probably the most widely used technique for building nonparametric probability density estimators. They are best understood by developing them as a special type of histogram. The difference is that the bins in kernel estimators are centered at the points at which the estimator is to be computed. The problem of the choice of location of the bins in histogram estimators does not arise.

Rosenblatt’s Histogram Estimator; Kernels

For the one-dimensional case, Rosenblatt (1956) defined a histogram that is shifted to be centered on the point at which the density is to be estimated. Given the sample y 1,...,y n, Rosenblatt’s histogram estimator at the point y is p� R(y) =

  1. {yi

s.t. yi ∈(y−h/2, y + h/2] } nh . (15.25) This histogram estimator avoids the ordinary histogram’s constant-slope contribution to the bias. This estimator is a step function with variable lengths of the intervals that have constant value. Rosenblatt’s centered histogram can also be written in terms of the ECDF: p� R(y) = P n(y + h/2)−P n(y−h/2) h , (15.26) where, as usual, P n denotes the ECDF. As seen in this expression, Rosenblatt’s estimator is a centered finite-difference approximation to the derivative of the empirical cumulative distribution function (which, of course, is not differentiable at the data points). We could, of course, use the same idea and form other density estimators using other finite-difference approximations to the derivative of P n. Another way to write Rosenblatt’s shifted histogram estimator over bins of length h is p� R(y) = 1 nh �n i=1 K � y−yi h � , (15.27) where K(t) = Ku(t) is the uniform or “boxcar” kernel of equation (4.65) with smoothing parameterλ= 1/2. Notice, however, that the role ofλin our earlier formulation of kernel functions is effectively taken over by the h; hence, we will consider it to be the smoothing parameter in this kind of density estimator. Other values of the smoothing parameter or kernel functions could be used,

Chapter 16. Statistical learning and data mining.

A major objective in data analysis is to identify interesting features or structure in the data. In this chapter, we consider the use of some of the tools and measures discussed in Chapters 9 and 10 to identify interesting structure. The graphical methods discussed in Chapter 8 are also very useful in discovering structure, but we do not consider those methods further in the present chapter.

There are basically two ways of thinking about “structure”. One has to do with counts of observations. In this approach, patterns in the density are the features of interest. We may be interested in whether the density is multimodal, whether it is skewed, whether there are holes in the density, and so on. The other approach seeks to identify relationships among the variables. The two approaches are related in the sense that if there are relationships among the variables, the density of the observations is higher in regions in which the relationships hold. Relationships among variables are generally not exact, and the relationships are identified by the higher density of observations that exhibit the approximate relationships.

An important kind of pattern in data is a relationship to time. Often, even though data are collected at different times, the time itself is not represented by a variable on the dataset. A simple example is one in which the data are collected sequentially at roughly equal intervals. In this case, the index of the observations may serve as a surrogate variable. Consider the small univariate dataset in Table 16.1, for example.

A static view of a histogram of these univariate data, as in Figure 16.1, shows a univariate bimodal dataset. Figure 16.2, however, in which the data are plotted against the index (by rows in Table 16.1), shows a completely different structure. The data appear to be sinusoidal with an increasing frequency. The sinusoidal responses at roughly equal sampling intervals result in a bimodal static distribution, which is the structure seen in the histogram.

Interesting structure may also be groups or clusters of data based on some measure of similarity, as discussed in Section 9.2 beginning on page 383. When there are separate groups in the data, but the observations do not contain an element or an index variable representing group membership, identifying nearby elements or clusters in the data requires some measure of similarity (or, equivalently, of dissimilarity).

Figure 16.3 shows four different bivariate datasets, each of which consists of two clusters. The criteria that distinguish the clusters are different in the datasets. In Figure 16.3(a), the clusters are defined by proximity; the points in each cluster are closer to the respective cluster centroid than they are to the centroid of the other cluster.

In Figures 16.3(b), 16.3(c), and 16.3(d), the definitions of the clusters are somewhat more difficult. The clusters are defined by characteristics of the clusters themselves (that is, by structures that the clusters individually exhibit). These clusters are sometimes called “conceptual clusters”; the points are members of a cluster because of some concept or holistic characteristic of the set of points, such as lying close to a straight line.

The plots in Figure 16.3 also illustrate one of the problems in the identifi- cation of clusters: In some cases, although the clusters are obvious, there are a few individual observations that could apparently belong to either cluster.

Figure 16.4 shows two different two-dimensional datasets (that is, datasets containing two variables) whose members fall almost within one-dimensional manifolds.

Linear Structure Nonlinear Structure
Fig. 16.4. Relational Structures in Bivariate Datasets

We may occasionally wish to identify any of the types of groups or structures shown in Figures 16.2, 16.3, and 16.4, but we will concentrate in this chapter on identifying the types of clusters shown in the first graph in Figure 16.3 (that is, clusters whose centroids are different).

Although we often assume that the data space is a subspace of IR m, a data space may be more general. Data, for example, may be character strings such as names. The more general types of data may be mapped from the original data space to a “feature space”, which is a subspace of IR m. The variables may be measured on different scales; they may, of course, represent completely different phenomena, so measurement scales cannot be made the same. One way of reconciling the measurements, however, is to standardize the data using the transformation (9.22) on page 392, : [math]\displaystyle{ XS = (X−X) diag(1/ √ sii ), }[/math] where X is the matrix whose constant columns contain the means of the corresponding columns of X, and √ sii is the sample standard deviation of the i th column of X.

We may be interested in finding the nearest neighbors of a given observation based on their similarity; or, alternatively, we may be interested in identifying all observations within a given degree of closeness to a given observation. This problem is called a “proximity search”.

In the following sections, we consider general problems in multivariate data analysis. Our emphasis will be on exploratory analysis, and the main goals will be to identify clusters in data and to determine lower-dimensional structures in multidimensional data. We will use the methods and measurements that we discussed in Section 9.2 beginning on page 383.

Interesting structure may involve clusters of data, or it may be the result of the data lying on or near a space of reduced dimension. Interesting structure may also be defined generically as properties of the data that differ from expected properties if the data were a random sample from a multivariate normal distribution or from some other standard distribution. The normal (or Gaussian) distribution lies at the heart of many methods of data analysis. The heuristic definition of structure as a departure from normality can be motivated by the fact that most randomly selected low-dimensional projections of any high-dimensional dataset will appear similar to a random sample from a multivariate normal distribution (see Diaconis and Freedman, 1984).

The usual objective in cluster analysis is to divide the observations into groups that are close to each other or are more homogeneous than the full set of observations. An observation may consist of categorical variables that may (or may not) specify the class to which the observation belongs. In general, as we discuss on page 385, if the i th observation can be represented as xi = (x r i , x c i ), (16.1) where the subvector x c i represents values of the categorical variables, we may wish to handle the x c i component separately. In Figure 16.3, for example, suppose that each observation consists of values for three variables, x 1 and x 2 as shown and a third variable that represents group membership that corresponds to the symbol in the graphical depiction. In that case, the classes may already be defined, or we may want to allow the possibility that observations with different values of the categorical variable nevertheless belong to the same class. In most of the following, we will assume that none of the variables are categorical.

16.1 Clustering and Classification

Identifying groups of similar observations in a dataset is an important step in making sense of the data and in understanding the phenomena represented by the data. Clustering, classification, and discrimination are terms that describe this activity, which lies at the crossroads of a number of traditional disciplines, including statistics, computer science, artificial intelligence, and electrical engineering. Classification is sometimes called statistical learning or machine learning, especially in the more engineering-oriented disciplines. As is often the case when scientific methods are developed within diverse areas, there are several slight variations of theory and methodology, which are sometimes described as “statistical”, “inductive”, and so on. The slight variations lead to a variety of terms to describe the methods, and there is generally scant theory to support optimality of one method over another. The various approaches to clustering and classification also lead to the use of terms such as “hypothesis”, “bias”, and “variance” that have different meanings from their technical statistical definitions.

Clustering and classification make use of a wide range of statistical techniques, both descriptive methods utilizing simple summary statistics and graphics and methods of fitting equations to data. Statistical techniques in clustering and classification often emphasize uncertainty and the importance of dealing with noise in the data. A good general reference on clustering and classification, generally from a statistical perspective, is Gordon (1999). Hastie, Tibshirani, and Friedman (2009) discuss classification using terminology from both the statistics and machine learning disciplines.

The first step in forming groups is to develop a definition of the groups. This may be based on similarities of the observations or on closeness of the observations to one another.

Clustering

Cluster analysis is generally exploratory. It seeks to determine what groups are present in the data. If the groups are known from some training set, “discriminant analysis” seeks to understand what makes the groups different and then to provide a method of classifying observations into the appropriate groups. When discriminant analysis is used to “train” a clustering method, we refer to the procedure as “supervised” classification. Discriminant analysis is mechanically simpler than cluster analysis. Clustering is “unsupervised” classification. We will discuss classification in Chapter 17.

Because of the large number of possibilities for grouping a set of data into clusters, we generally must make some decisions to simplify the problem. One way is to decide a priori on the number of clusters; this is done in K-means clustering, discussed below. Another way is to do recursive clustering; that is, once trial clusters are formed, observations are not exchanged from one cluster to another. Two pairs of observations that are in different clusters at one stage of the clustering process would never be split so that at a later stage one member of each pair is in one cluster and the other member of each pair is in a different cluster.

There are two fundamentally different approaches to recursive clustering. One way is to start with the full dataset as a single group and, based on some reasonable criterion, partition the dataset into two groups. This is called divisive clustering. The criterion may be the value of some single variable; for example, any observation with a value of the third variable larger than 5 may be placed into one group and the other observations placed in the other

Chapter 17: Statistical Models of Dependencies

In the models and data-generating processes we have considered in previous chapters of Part IV, all of the variables or features were treated essentially in the same way. In this chapter, we consider models in which a subset of variables, often just one, is of special interest. This variable is the “response”, and we seek to understand its dependence on the other variables. “Dependence” here refers to a stochastic relationship, not a causal one. By knowing the relationship of the other variables to the response variable, we may better understand the data-generating process, or we may be able to predict the response, given the values of the associated variables.

The models we consider in this chapter describe the stochastic behavior of one variable, Y , possibly a vector, as a function of other variables. Models of this type that express dependencies are called regression models if Y is a numeric variable or classification models if Y is a categorical variable. If Y is a numeric variable that takes on only a countable number of values, the model can be considered either a regression model (sometimes a “generalized model”) or a classification model.

Another important type of dependency arises in sequentially sampled variables. The distribution of a random variable at time t k depends on the realization of that random variable at times before t k. There may also be covariates whose realizations affect the distribution of the variable of interest. A random process that possibly changes in time is called a stochastic process. Because change itself is of interest in such processes, the model is often expressed as a differential equation.

The development and use of a model is an iterative process that includes data collection and analysis. It requires looking at the data from various perspectives. The model embodies both knowledge and assumptions. The knowledge may result from first principles or from previous observations. A model can be strong (very specific) or weak (leaving open many possibilities).

If the range of possibilities in a model can be limited to a set of real numbers, the possibilities are represented by a parameter. Parametric statistical procedures involve inference about the parameters of a model. Nonparametric methods in statistics also rely on models that usually contain parameters; the phrase “nonparametric” often refers to a method that does not specify a family of probability distributions except in a very general way.

Models

While some methods are referred to as “model-free”, and the phrase “model-based approach” is sometimes used to describe a statistical method, implying that other, “non-model-based” approaches exist, in reality some model underlies all statistical analyses. The model is not immutable, however, and much of the effort of an analysis may go into developing and refining a model. In exploratory data analysis, or EDA, the model is quite weak. The patterns and other characteristics identified in an exploratory data analysis help to form a stronger model. In general, whenever the model is weak, a primary objective is usually to build a stronger model.

There are various types of models. They may have different purposes. A common form of a model is a mathematical equation or a system of equations. If the purpose of the model is to enhance the understanding of some phenomenon, there would be a large premium on simplicity of the model. If the model is very complicated, it may correspond very well to the reality being studied, but it is unlikely to be understandable. If its primary purpose is to aid understanding, an equation model should be relatively simple. It should not require an extensive period of time for scrutiny.

A model may be embedded in a computer program. In this case, the model itself is not ordinarily scrutinized; only its input and output are studied. The complexity of the model is not of essential consequence. Especially if the objective is prediction of a response given values of the associated variables, and if there is a large premium on making accurate predictions or classifications in a very short time, an algorithmic model may be appropriate. An algorithmic model prioritizes prediction accuracy. The details of the model may be very different from the details of the data-generating process being modeled. That is not relevant; the important thing is how well the output of the algorithmic model compares to the output of the data-generating process being modeled when they are given the same input.

Model Inference Using Data

Data analysis usually proceeds through some fairly standard steps. Before much is known about the process being investigated, the statistician may just explore the data to arrive at a general understanding of its structure. This may involve many graphical displays in various coordinate systems and under various projections. When more than one variable is present, relationships among the variables may be explored and models describing these relationships developed.

In Section 17.4, we discuss the use of observational data to fit a particular kind of model, namely a classification model. Finally, in Section 17.5, we discuss the use of transformations of data so that a model of a given form will fit it better. One-to-one transformations do not result in any information loss, and they often make observational data easier to understand.

17.1 Regression and Classification Models

In many applications, some subset of variables may be characterized as “dependent” on some other subset of variables; in fact, often there is just a single “dependent” variable, and our objective is to characterize its value in terms of the values of the other variables. (The word “dependent” is in quotes here because we do not wish necessarily to allow any connotation of causation or other stronger meanings of the word “dependence”. In the following, we use “dependent” in this casual way but do not continue to emphasize that fact with quotation marks.) The dependent variable is often called the “response”, and the other variables are often called “factors”, “regressors”, “independent variables”, “carriers”, “stimuli”, or “covariates”. (The word “independent” has some connotative difficulties because of its use in referring to a stochastic property, and some authors object to the use of “independent” here. Most choices of words have one kind of problem or another. A problem with “stimulus”, for example, is the implication of causation. I am likely to use any one of these terms at various times. Fortunately, it does not matter; the meaning is always clear from the context.)

The asymmetric relationship between a random variable Y and a variable x may be represented as a black box that accepts x as input and outputs Y :

[math]\displaystyle{ Y←unknown process←x. (17.1) }[/math]

The relationship might also be described by a statement of the form

[math]\displaystyle{ Y←f(x) }[/math]

or

[math]\displaystyle{ Y≈f(x). (17.2) }[/math]

If f has an inverse, the model (17.2) appears symmetric. Even in that case, however, there is an asymmetry that results from the role of random variables in the model; we model the response as a random variable. We may think of f(x) as a systematic effect and write the model with an additive adjustment, or error, as

[math]\displaystyle{ Y = f(x) + E (17.3) }[/math]

or with a multiplicative error as

[math]\displaystyle{ Y = f(x)Δ, (17.4) }[/math]

where E andΔare assumed to be random variables. (The “E” is the Greek uppercase epsilon.) We refer to these as “errors”, although this word does not indicate a mistake. Thus, the model is composed of a systematic component related to the values of x and a random component that accounts for the indeterminacy between Y and f(x). The relative contribution to the variability in the observed Y due to the systematic component and due to the random component is called the signal to noise ratio. (Notice that this is a nontechnical term here; we could quantify it more precisely in certain classes of models.)

In the case of the black-box model (17.1), both the systematic and random components are embedded in the box.

Models with multiplicative random effects are not as widely used. In the following, we will concentrate on models with additive random effects. In such models, E is also called the “residual”.

Because the functional form f of the relationship between Y and x may contain a parameter, we may write the equation in the model as

[math]\displaystyle{ Y = f(x;θ) + E, (17.5) whereθis a parameter whose value determines a specific relationship within the family specified by f. In most cases,θis a vector. In the usual linear regression model, for example, the parameter is a vector with two more elements than the number of elements in x, : \lt math\gt Y = β_0 + x^Tβ+ E, (17.6) }[/math] where [math]\displaystyle{ θ= (β_0, β, σ^2) }[/math].

A generalization of the linear model (17.6) is the additive model,

[math]\displaystyle{ Y =β 0 + f 1(x 1, β 1) + · · · + f m(x m, β m) + E. (17.7) }[/math]

The specification of the distribution of the random component is a part of the model, and that part of the model can range from very general assumptions about the existence of certain moments or about the general shape of the density to very specific assumptions about the distribution. If the random component is additive, the mean, or more generally (because the moments may not exist) the appropriate location parameter, is assumed without loss of generality to be 0.

The model for the relationship between Y and x includes the equation (17.5) together with various other statements about Y and x such as the nature of the values that they may assume, statements aboutθ, and statements about the distribution of E. Thus, the model is : [math]\displaystyle{ \twolinebracket Y = f(x;θ) + E \text{additional statements about} Y, x, θ, E. (17.8) }[/math] In the following, for convenience, we will often refer to just the equation as the “model”.

Another way of viewing the systematic component of equation (17.5) is as a conditional expectation of a random variable,

[math]\displaystyle{ E(Y \mid x;θ) = f(x;θ). (17.9) }[/math]

This formulation is very similar to that of equations (17.3) and (17.4). The main differences are that in the formulation of equation (17.9) we do not distinguish between an additive error and a multiplicative one, and we consider the error to be a random variable with finite first moment. In equations (17.3) and (17.4), or equation (17.5), we do not necessarily make these assumptions about E orΔ.

If we assume a probability distribution for the random component, we may write the model in terms of a probability density for the response in terms of the systematic component of the model, p(y \mid x, θ). (17.10)

Cast in this way, the problem of statistical modeling is the same as fitting a probability distribution whose density depends on the values of a covariate.

Generalized Models

If the response can take on values only in a countable set, a model of the form :[math]\displaystyle{ Y = f(x;θ) + E }[/math] may not be appropriate, especially if the covariate x is continuous.

Suppose, for example, that the response is binary (0 or 1) representing whether or not a person has had a heart attack, and x contains various biometric measures such as blood pressure, cholesterol level, and so on. The expected value E(Y \mid x;θ) is the probability that a person with x has had a heart attack. Even if a model such as

[math]\displaystyle{ E(Y \mid x;θ) = f(x;θ), (17.11) }[/math]

with continuous regressor x, made sense, it would not be clear how to fit the model to data or to make inferences about the model. A simple transformation of the response variable,τ(Y ), does not improve the data-analysis problem; if Y is binary, so isτ(Y ).

A problem with the model in this form is that the value of f(x;θ) must range between 0 and 1 for all (reasonable) values of x andθ. A function f could of course be constructed to have this range, but another way is to model a transformation of E(Y \mid x;θ). This can often be done in a way that has a meaningful interpretation. We first let π(x;θ) = E(Y \mid x;θ), (17.12) which we generally write simply asπ. If Y is binary with values 0 and 1 then πis just the probability that Y = 1.

Next, we introduce some function g(π), that, it is hoped, relates to the data-generating process being studied. The function g is called the link function. The appropriate form of the link function depends on the nature of the probability distribution. Notice that the link function is not just a transformation of the observable variable Y , as isτ(Y ) above. The link function is a transformation of the expected value of Y .

We now can form a “generalized model” that is more similar to the forms of the models for continuous responses than a form that models Y directly; that is, g(π)≈h(x;θ). (17.13)

In the case of Y being binary, lettingπ= E(Y \mid x;θ), we may introduce the transformation g(π) = log � π 1−π � = logit(π). (17.14)

The logit function in equation (17.14) is the “odds ratio”. The logit function is useful for a Bernoulli distribution (that is, a binary response).

In this kind of problem, we often form a generalized model such as g(π) =β 0 +β 1x 1 + · · · +β mx m. (17.15)

The generalized model formed in this way is called a logistic regression model. For different distributions, other link functions may be more appropriate or useful. If Y is not binary, but takes on a countable number of values, the link function may be the log of the expected value of Y .

In many useful models, h(x;θ) in equation (17.13) is linear inθ. The resulting model, such as equation (17.15), is called a generalized linear model. The analysis of generalized models is usually based on a maximum likelihood approach, and the elements of the analysis are often identified and organized in a manner that parallels the ANOVA of general linear models.

Classification Models

In the generic model for the classification problem, the variables are the pairs (Y,x), where Y is a categorical variable representing the subclass of the population to which the observation belongs. The generalized models discussed above can be viewed as classification models. A generalized model yields the probability that an observation has a particular response, or that the observation is in a given category. A logit

Overfitting

In any application in which we fit an overdetermined system,

[math]\displaystyle{ y≈Xb, }[/math]

it is likely that the given values of X and y are only a sample (not necessarily a random sample) from some universe of interest. There is perhaps some error in the measurements. It is also possible that there is some other variable not included in the columns of X. In addition, there may be some underlying randomness that cannot be accounted for.

The fact that y �= Xb for all b results because the relationship is not exact. Whatever value of b provides the best fit (in terms of the criterion chosen) may not provide the best fit if some other equally valid values of X and y were used. The given dataset is fit optimally, but the underlying phenomenon of interest may not be modeled very well. The given dataset may suggest relationships among the variables that are not present in the larger universe of interest. Some element of the “true”b may be zero, but in the best fit for a given dataset, the value of that element may be significantly different from zero. Deciding on the evidence provided by a given dataset that there is a relationship among certain variables when indeed there is no relationship in the broader universe is an example of overfitting.

There are various approaches we may take to avoid overfitting, but there is no panacea. The problem is inherent in the process. One approach to overfitting is regularization. In this technique, we restrain the values of b in some way. Minimizing �y−Xb� may yield a b with large elements, or values that are likely to vary widely from one dataset to another. One way of “regularizing” the solution is to minimize also some norm of b. We will discuss this approach on page 607.

17.2 Probability Distributions in Models

Statistical inference (that is, estimation, testing, or prediction) is predicated on probability distributions. This means that the model (17.8) or (17.21) must include some specification of the distribution of the random component or the residual term, E. In statistical modeling, we often assume that the independent variables are fixed — not necessarily that there is no probability distribution associated with them, but that the set of observations we have are considered as given — and the random mechanism generating them is not of particular interest. This is another aspect of the asymmetry of the model. This distinction between what is random and what is not is a basic distinction between regression analysis and correlation analysis, although we do not wish to imply that this is a hard and fast distinction.

The probability distribution for the residual term determines a family of probability distributions for the response variable. This specification may be very complete, such as that E∼N(0, σ 2 I), or it may be much less specific. If there is no a priori specification of the distribution of the residual term, the main point of the statistical inference may be to provide that specification. If the distribution of the residual has a first moment, it is taken to be 0; hence, the model (for an individual Y ) can be written as E(Y ) = f(x;θ); that is, the expected value of Y is the systematic effect in the model. If the first moment does not exist, the median is usually taken to be 0 so that the median of y is f(x;θ). More generally, we can think of the model as expressing a conditional probability density for Y : p Y (y) = p(y \mid f(x;θ)). (17.26)

Hierarchical Models

The basic general model (17.8) can be a component of a hierarchical model:

y = f(x;θ) + �, x = g(w;τ) +δ, (17.27)

or

y = f(x;θ) + �, θ∼D(τ), (17.28)

where D(τ) is some distribution that may depend on a parameter,τ. Either of these models could be part of a larger hierarchy, of course.

Hierarchical models of the form (17.27) arise in various applications, such as population dynamics, where the components are often called “compartments”, or in situations where the independent variables are assumed not to be observable or not to be measured without error.

Hierarchical models of the form (17.28) are useful in Bayesian statistics. In that case, we may identify the components of the model as various joint, marginal, and conditional distributions. For example, following the general outline presented on page 43, we may consider the joint distribution of the random variables Y andΘ(using an uppercase letter to emphasize that it is a random variable), (Y,Θ)∼DY,Θ(x, τ), the conditional distribution of Y givenΘ, Y∼DY \mid θ(f(x;θ)), or the conditional distribution of Θ given Y, …

17.3 Fitting Models to Data

Observational data help us to build a model. The model helps us to understand nature. Standard ways of developing our knowledge of nature involve estimation and tests of hypotheses — that is, statistical inference. Inference about the model y = f(X;θ) + � involves estimation of the parametersθ, tests of hypotheses aboutθ, and inference about the probability distribution of �. It may also involve further consideration of the model, the form of f, or other issues relating to the population, such as whether the population or the sample is homogeneous, whether certain observations are outliers, and so on.

The statistical characteristics of the estimator, such as its bias and variance, depend on how the model is fit (that is, on how the estimates are computed) and on the distribution of the random component. For a specific family of distributions of the random component and a specific form of f, it may be possible to determine estimators that are optimal with respect to some statistical characteristic such as mean squared error. A unified approach to model inference involves a method of estimation that allows for statements of confidence and that provides the basis for the subsequent inference regarding the distribution of � and the suitability of the model. In this section, we will be concerned primarily with methods of fitting the model rather than on the problem of statistical inference.

The Mechanics of Fitting

Fitting a model using data can be viewed simply as a mechanical problem of determining values of the parameters so that functional relationships expressed by the model are satisfied approximately by the given set of data. Fitting a model to data is often a step in statistical estimation, but estimation generally involves a deeper belief about the nature of the data. The data are realizations of a random variable whose distribution is related to or specified by the model. Statistical estimation also includes some assessment of the distribution of the estimator.

One of the first considerations is what approach to take in modeling. Is the objective to develop a model in the form of equations and statements about distributions of elements of the model, as in equation (17.8), or is it acceptable to have a black box of the form (17.1) together with an algorithm that accepts x and produces a good prediction of Y ? Often, a set of rules is sufficient. Because there is no particular restriction on the complexity of the rules as there would be if we attempt to express the rules in a single equation, the black box together with a prediction algorithm performs best. A neural net, which can be quite complicated yet provides no insight into identifiable functions and parameters, often yields excellent predictions of the response for a given input x.

An additional consideration is whether the fit is to be global or local (that is, whether a single model describes the data over the full domain and all observations are used at once to fit the model, or whether different models apply in different domains and only “local” observations are used to fit the model within each domain).

On page 587, we listed five basic approaches for fitting models using data: method of moments, minimizing residuals, maximizing the likelihood, homogeneity of modeled classes, and predictive accuracy in partitioned data. Any of these approaches can be applied in fitting models that have a continuousvalued response. (We can assess class homogeneity if one or more of the covariates is a classification variable, or we may be able to discretize the response into meaningful groups.) If the model has a discrete-valued response, or if the purpose is classification, there are two possibilities. One is the use of a generalized model, which effectively makes the response continuous-valued and can yield a probability-based or fuzzy classification. Otherwise, the fitting problem can be addressed directly, and the class purity is the primary criterion. In the classification problem, the predictive accuracy in partitioned data is almost always considered in the model fitting. The dataset can be partitioned either randomly or systematically, as we discuss in Chapter 12. Whatever method is used to fit a model, it may be followed by some further steps to bring the model and the data into closer agreement. Individual observations that do not agree well with the fitted model may be treated specially in some way. Some outlying observations may be modified or even removed from the dataset, and then the model is refit using the cleaned data.

Estimation by Minimizing Residuals

Of the basic approaches for fitting models using data, listed on page 587, perhaps the most intuitive is fitting based on minimizing the residuals. This is the method most often used by data analysts and scientists to fit a curve to data without any assumptions about probability distributions.

For a given function f, the fit is accomplished by solving an optimization problem involving some function of the vector of residuals, r = y−f(X;θ), where y is the vector of observed responses and X is the matrix of corresponding observations of the covariates. The decision variable in the optimization problem isθ.

Notice that the ri are vertical distances as shown in Figure 17.1 for a simple linear model. Another way of measuring residuals in a model is indicated by the orthogonal residuals, di , shown in Figure 17.3.

For a given set of data {(yi , xi )}, the residuals ri are functions of f and θ. Clearly, the space of functions from which to select f must be restricted in some way; otherwise, the problem is not well-defined. We generally restrict the function space to contain only relatively tractable functions, such as low-degree polynomials, often just linear functions, exponential functions, or functions that can be formed from common basis functions, as we discuss in Section 4.2 and in Chapter 10. Once a general form of f is chosen, the residuals are functions ofθ, ri (θ). …

Least Squares Estimation in Linear Models

The most familiar example of fitting a statistical model to data uses the linear regression model (17.22): y = Xβ+ �. The least-squares fit of this is exactly the same as the least-squares fit of the overdetermined system (5.54) Xb≈y on page 229 in Section 5.6. It is the solution to the normal equations, XT Xβ� = XTy, which, as we determined in that section, is β� = X +y, which, in the case of full-rank X, is β� = (X T X) −1 X Ty. (17.39) As we pointed out in Section 5.6, fitting a model is not the same as statistical inference. Statistical inference requires a probability distribution on � as part of the model. If we ignore any distributions on X andβ, or at least just state results conditionally on X andβ, we have that if E(�) = 0 thenβ� above is unbiased forβ, that is, E(β�) =β, and V(β�) = X +V(�)(X +) T.

With an additional assumption that V(�) =σ 2 I, for a constantσ 2 , we have V(β�) = X +(X +) Tσ 2 , but more importantly, we have the Gauss-Markov theorem thatβ� is the best (in the sense of minimum variance of all linear combinations) linear unbiased estimator ofβ.

With the assumption that �∼Nn(0, σ 2 I), we have distributional properties ofβ� as well as of (y−Xβ�) T(y−Xβ�) that allow us to develop most powerful statistical tests of hypotheses regardingβ.

Another interesting property of the least-squares estimator is the relationship between the model with the centered data, as in equation (17.24), y c = Xc βc + �, and the model that may include an intercept, equation (17.22), y = Xβ+ �. The least squares estimators are the same (except of courseβc does not contain a term corresponding to an intercept), and in any event, equation (17.25) is satisfied by the estimator y¯ = Xβ� without imposing this as a separate constraint. (Recall that fitting the equation by minimizing some other norm of the residuals does not ensure that the fit goes through the means.)

Variations on Minimizing Residuals

When data are contaminated with some (unidentified) observations from a different data-generating process, it is desirable to reduce or eliminate the effect of these contaminants. Even if all of the observations arise from the same data-generating process, if that process is subject to extreme variance, it may be desirable to reduce the effect of observations that lie far from the mean of the model. It is not possible, of course, to know which observations

Algorithm 17.1 Iterative Orthogonal Residual Fitting through the

Means

Variable Selection

If we start with a model such as equation (17.5), Y = f(x;θ) + E, we are ignoring the most fundamental problem in data analysis: which variables are really related to Y , and how are they related?

We often begin with the premise that a linear relationship is at least a good approximation locally; that is, with restricted ranges of the variables. This leaves us with one of the most important tasks in linear regression analysis: selection of the variables to include in the model. There are many statistical issues that must be taken into consideration. …

Projection Pursuit Regression

In Chapter 16, we discussed several methods for analyzing multivariate data with the objective of identifying structure in the data or perhaps reducing the dimension of the data. Two projection methods discussed were principal components and projection pursuit. In models for dependencies, we often apply these multivariate methods just to the independent variables. In linear regression, principal components can be used to reduce the effects of multicollinearity among the independent variables. The dependent variable is regressed on a smaller number of linear combinations of the original set of independent variables.

Projection pursuit is a method of finding interesting projections of data. In the discussion of projection pursuit beginning on page 564, it was applied to a multivariate dataset with the objective of identifying structure in the data.

In regression modeling, projection pursuit is often applied to an additive model of the form Y =β 0 + �m j fj (xj , β) + E.</math> The idea is to seek lower-dimensional projections of the independent variables, that is, to fit the model Y =β 0 + �m j fj (α T j x) + E.

Projection pursuit regression involves the same kind of iterations described beginning on page 567 except that they are applied to the model residuals.

17.4 Classification

In Section 16.1, we considered the general problem of clustering data based on similarities within groups of data. That process is sometimes called “unsupervised classification”. In “supervised classification”, usually just called “classification”, we know to which groups all of our observed data, or at least a training set of data, belong. The objective is to develop a model for classi- fication of new data.

Classification and Regression Trees

A rooted tree that is used to define a process based on a sequence of choices is called a decision tree. Decision trees that are used for classifying observations are called classification trees. They are similar to the cluster tree shown in Figure 16.5 on page 524 except that the nodes in the classification trees represent decisions based on values of the independent variables. Decision trees that are used for predicting values of a continuous response variable are called regression trees. The terminal nodes correspond to predicted values or intervals of predicted values of the response variable.

The objective in building a classification tree is to determine a sequence of simple decisions that would ultimately divide the set of observations into the given groups. Each decision is generally in the form of a comparison test based on the values of the variables that constitute an observation. Often, the comparison test involves only a single variable. If the variable takes on only a countable number of values, the comparison test based on the variable may be chosen to have as many possible outcomes as the values associated with the variable. If the variable has a continuous range of values, the comparison test is generally chosen to have a binary outcome corresponding to observations above or below some cutpoint. Trees with exactly two branches emanating from each nonterminal node are called binary trees. The terminal nodes of a classification tree represent groups or classes that are not to be subdivided further.

How well a test divides the observations is determined by the “impurity” of the resulting groups. There are several ways of measuring the impurity or how well a test divides the observations. Breiman et al. (1984) describe some methods, including a “twoing rule” for a binary test. For classifying the observations into k groups, at a node with n observations to classify, this rule assigns a value to a test based on the formula n Ln R � �k i=1 \mid L_i n L−Ri n R \mid /n �2 , (17.62) where n L and n R are the number of observations that the test assigns to the left and right child nodes, respectively, and Li and Ri are the number of group i assigned to the left and right, respectively. The classification tree can be built by a greedy divide-and-conquer recursive partitioning scheme, as given in Algorithm 17.2.

Algorithm 17.2 Recursive Partitioning for Classification Using a Training Set

1. Evaluate all tests that divide the given set into mutually exclusive sets. 2. Choose the test that scores highest, and divide the set based on this test. 3. For any subset that contains observations from more than one group, repeat beginning at step 1.

The results of a binary partitioning may be presented as a tree, as shown in Figure 17.4, in which the rule for splitting at each node is shown. In this example, there are three groups and two numerical variables.

Linear Classifiers

In the following we will assume that there are only two classes. While there are some apparently obvious ways of extending a binary classification method to more classes, the problem is not as simple as it may appear, and we will not pursue it here.

In a very simple instance of binary classification, a oblique linear split on the single root node of a classification tree yields two pure classes. This is illustrated in Figure 17.5, in which the points are classified based on whether they are above or below the line w Tx = w0.

This illustrates how the data in two dimensions, that is the 2-vector x, could be reduced to one dimension. That one dimension would correspond to points along a single line normal to the line w Tx = w0. The bivariate points projected onto that lower dimensional space would be perfectly classified in the sense that a single criterion would separate the points. It is obvious from the figure that the separating line (or in general, separating hyperplane) may not be unique.

Fig. 17.5. Linear Classification

Real world data rarely can be perfectly separated by a linear classifier as in this simple example. Whenever the data cannot be perfectly separated, we might consider some kind of optimal separating hyperplane. A simple criterion of optimality is a separation such that the sum of squares from the means of the two classes is maximal with respect to the sums of squares from the means within the two classes. This is the kind of decomposition done in analysis of variance.

In the classification problem, however, we generally do not assume that the variance-covariance matrix is necessarily proportional to the identity, as we do in the usual analysis of variance. The appropriate sums of squares must be scaled by the variance-covariance matrix. Since the variance-covariance matrix is likely unknown, we use the sample variance-covariance matrix, equation (9.10), applied to the separate groups.

A second problem arises in this approach, and that is the question of whether the variance-covariance matrix is the same in the two groups. There are various alternative procedures to try to accommodate different variance-covariance matrices, but we will not pursue them here. (The Behrens-Fisher problem in a two-group t-test is one of the most familiar of this type of problem.) …

Kernel Methods and Support Vector Machines

Identification of nonlinear structure in data is often preceded by a nonlinear mapping from the m-dimensional data space to a p-dimensional “feature space”. The feature space may have more, even many more, dimensions than the data space. The mapping carries the basic data m-vector x to a p-vector f(x). The feature space is a subspace of IR p and thus is an inner-product space. (The original data space may be more general.) This is essentially the approach of support vector machines (see Mika et al. 2004). The basic idea is that while there may be no separating hyperplane in the data space as in Figure 17.5, there may be a separating hyperplane in the higher-dimensional feature space. The function f may be quite general. It may treat the elements of its vector argument differently, but a common form for the argument x = (x 1,...,x m), has the form � x e p 1 ,...,x e p m , x e p−1 1 ,...,x e p−1 m ,...,x e 1 1 ,...,x e 1 m � . Another common form is homogeneous in the elements of the data space vector, for example, for the 2-vector x = (x 1, x 2), x˜= f(x) = � x 2 1 , √ 2x 1x 2, x 2 2

. (17.67)

Figure 17.6 illustrates a situation in which there is no separating hyperplane (line) in the two-dimensional data space. After transforming the data space to a feature space, however, a hyperplane can be chosen to separate the data. In the case shown in the right side of Figure 17.6, the data were transformed into a three-dimensional feature space, a separating plane was determined, and then the data were projected onto a z 1-z 2 plane that is perpendicular to the separating plane.

A basic operation in the use of support vector machines in classification is the computation of inner products in the feature space; that is, the computation of ˜x Ty˜. The computations to transform a data vector to a feature vector x˜= f(x), as well as the computation of the inner products in the higherdimensional feature space, may be costly. The overall computations may be reduced by a simple expedient called the “kernel trick”, which involves finding a kernel function (see page 23) in the data space that is equivalent to the inner product in the feature space. The form of the kernel obviously depends on f. For the function shown in equation (17.67), for example, we have (f(x)) T f(y) = � x 2 1 , √ 2x 1x 2, x 2 2

T � y 2 1 , √ 2y 1y 2, y 2 2

= � x Ty � 2 = K(x, y), (17.68) where we define the kernel K(x, y) as � x Ty � 2 .

Although support vector machines can be used for exploratory clustering, most of the applications are for classification with a given training set.

Fig. 17.6. Data Space Transformed to Feature Space and Linear Classification in the Feature Space
Combining Classifications

Classification methods tend to be unstable; that is, small perturbations in the training datasets may yield large differences in the classification rules. To overcome the instability, which manifests itself in large variability of the classes, various ensemble methods have been suggested. These methods make multiple passes over the training dataset and then average the results. In some cases, the multiple passes may use the full training dataset with different classification methods, different subsets of the classification variables, or with perturbation of the results. In other cases the multiple passes use subsets or resamples of the full training dataset.

The averaging is essentially a voting by the various classifications formed following resampling. The idea of averaging is an old one; in this context it is sometimes called “stacked generalization”. A similar idea in linear regression in which subsamples of minimal size are used is called “elemental regression”. (The minimal size in linear regression is a dataset that has the same number of observations as variables.) The coefficient estimates are averaged over the subsamples.

In some cases, the full training dataset is used on each pass, but different classification methods are used or else following each pass, the results are perturbed. One of the most effective ways of doing this is to reweight the variables following each classification pass. The variables that show the weakest discrimination ability in the most recent pass are given heavier weight in the next pass. This method is called “boosting”. Similar methods are also called “arcing”, for “adaptively resample and combine”. In another multiple-pass method that uses the full dataset on each pass, instead of modifying the weights or the method of classification in each pass, the results are randomly perturbed, that is, the classes to which the observations are assigned are randomly changed. The randomly perturbed classes are then averaged. This method, surprisingly, also sometimes improves the classification.

Other ensemble methods use either a resample or a subsample of the full dataset. A method of forming random training sets that uses bootstrapping (that is, it resamples) the given dataset is called “bagging”, because of the bootstrap samples. For a given “bag” or bootstrap sample, the classification tree is constructed by randomly choosing subsets of the features to use in forming the branches from nodes. The process is continued until the full tree is formed (that is, the tree is not “pruned”).

If random subsamples of the training dataset are used to form classification trees, the collection of random trees is called a random forest. Within the broad range of ensemble classification methods there is a multitude of details that can be changed, resulting in an almost overwhelming number of classification methods. There does not appear to be a method that is consistently best. Some methods are better on some datasets than others, and there is no simple way of describing the optimality properties.

An easy way to explore the various methods is by use of the Weka program, which is a collection of many algorithms for clustering and classification. Weka can either be applied directly to a dataset or called from the user’s Java code. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It can also be used for developing new machine learning schemes. It is open source software available at http://www.cs.waikato.ac.nz/ml/weka/

It is useful to compare classification methods using different datasets. An important collection of datasets is available at the Machine Learning Repository at the University of California at Irvine. The UCI Machine Learning Repository can be accessed at http://archive.ics.uci.edu/ml/

Assessing the Fit of a Model

We have described various approaches to fitting a model using data, such as maximum likelihood, based on an assumed probability distribution, fitting by minimizing the residuals, and fitting by matching moments. Each of these methods has its place. In some cases, a particular type of one of these general schemes is identical to a particular type of another; for example, maximum likelihood under a normal distribution is equivalent to minimizing the sum of the squares of the residuals. Whatever method is used in fitting the model, we are interested in the goodness of the fit. If the method of fit is to minimize some norm of the residuals, intuitively it would seem that the proportional reduction in the norm of the residuals by fitting a model gives some indication of the goodness of that model. In other words, we compare the norm of the “residuals” with no model (that is, just the responses themselves) with the norm of the residuals after fitting the model, �r� �y� , which should be small if the fit is good or, looking at it another way, �y�−�r� �y� , (17.72) which should be close to 1 if the fit is good. Expression (17.72) is the familiar R2 statistic from least squares linear regression.

Although a particular method of fitting a model may yield a relatively small norm of the residual vector, certain observations may have very large residuals. In some cases of model fitting, deleting one or more observations may have a very large effect on the model fit. If one or more observations have very large residuals (the meaning of “very large” is not specified here), we may question whether the model is appropriate for these anomalous observations, even if the model is useful for the rest of the dataset. We may also question whether the method of fit is appropriate. The method of fit should come under special scrutiny if deleting one or more observations has a very large effect on the model fit.

Quantile plots are especially useful in assessing the validity of assumptions in a model. If the residuals are ranked from smallest to largest, the pattern should be similar to a pattern of ranked normal random variables. Given a sample of n residuals, r (1) ≤. . .≤r (n) , we compare the observed values with the theoretical values corresponding to probabilities p(1) ≤. . .≤p(n) .

Although the sample distribution of the residuals may give some indication of how well the data fit the model, because of the interactions of the residuals with the fitting process, the problem of assessing whether the model is appropriate is ill-posed. The most we can generally hope for is to assess the predictive ability of the model by use of partitioned data and cross validation, as discussed in Section 12.2.

Notes and Further Reading

Linear, General Linear, and Generalized Linear Models

Full rank linear statistical models of dependencies are called regression models, and linear classification models (not of full rank) are called ANOVA models or general linear models. Kutner, Nachtsheim, and Neter (2004) provide an extensive coverage of these linear statistical models. Nonlinear regression models are covered in some detail by Seber and Wild (2003). The generalized linear model is a special type of nonlinear model that was introduced in its general setting by Nelder and Wedderburn (1972). A thorough coverage of generalized linear models is available in McCullagh and Nelder (1990). The special logistic generalized linear models are the subject of Hosmer and Lemeshow (2000).

An extensive discussion about the modeling issues and the statistical interpretation of the errors-in-variables model is provided by Fuller (1987).

Algorithmic Models

The article by Breiman (2001), with discussion, emphasizes the role of algorithmic models when the objective is prediction instead of a simple description with the primary aim of aiding understanding. A classic type of algorithmic model is a neural net. See Ripley (1993, 1994, 1996) for discussion of neural nets in modeling applications.

Transformations

Kutner, Nachtsheim, and Neter (2004) and Carroll and Ruppert (1988) have extensive discussions of various transformations for linear models. Velilla (1995) discusses multivariate Box-Cox transformations, as well as the robustness of the transformations to outliers and diagnostics for detecting the effect of outliers.

Recursive Partitioning

Many of the clustering methods discussed in Chapter 16 and the classification methods discussed in this chapter are based on recursively partitioning the dataset. Trees or other graphs are used to maintain the status during the recursion. General methods as well as specific applications of recursive partitions are summarized by Zhang (2004).

Conceptual Clustering

Hunt, Marin, and Stone (1966) called classification that is based on general concepts “concept learning”, and developed concept-learning systems. (Their purpose was to study ways that humans learn, rather than to address a given classification problem.) Michalski (1980) and Michalski and Stepp (1983) described “conceptual clustering”, which is a set of methods of classification that identifies sets of characteristics, or concepts.

Combining Classifications

Most of the best classifiers are based on ensemble methods in which classi- fiers are combined or improved by reweighting as in bagging and boosting. B¨uhlmann (2004) provides an overview and summary of ensemble methods for classification.

Some examples of applications in classification in which combined classifiers had markedly superior performance are described by Richeldi and Rossotto (1997) and by Westphal and Nakhaeizadeh (1997).

Regularized Fitting and Variable Selection

The use of ridge regression to deal with the problems of multicollinearity in regression analysis was described in a 1962 paper by A. E. Hoerl. Hoerl and Kennard (1970a,b) discussed various applications of ridge regression, described how it ameliorates the effects of multicollinearity, and gave several suggestions for the choice of the weighting factorλ. The regularization used in ridge regression is called Tikhonov regularization in the applied mathematical literature after A. N. Tikhonov who described it in a 1963 paper (in Russian).

The lasso was first described by Tibshirani (1996). The least angle regression (LAR) algorithm for variable selection, which happens to correspond to lasso with increasing weights on the L1 penalty on the fitted coefficients, was described by Efron et al. (2004).

Many statistical issues relating to variable selection and regression model building are discussed in the book by Miller (2002).

In the linear regression model y = Xβ, instead of fitting the original set of x’s, we may use a set of the principal components ˜x’s from equation (16.15). This is called principal components regression. While principal components regression does not remove any of the original variables from the model, the model itself, y = Xβ� , is in a lower dimension. Jolliffe, Trendafilov, and Uddin (2003) describe a regularization similar to that in lasso some of the ˆwj in equation (16.14) to zero, so that the corresponding original variables are no longer in the model at all.

}}

Index

http://mason.gmu.edu/~jgentle/books/cmstatbk/Computational_Statistics_Index.pdf


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 ComputationalStatisticsJames E. GentleComputational Statisticshttp://books.google.com/books?id=m4r-KVxpLsAC2009