2006 PredictionLearningandGames

(Cesa-Bianchi & Lugosi, 2006) ⇒ Nicolo Cesa-Bianchi, and Gabor Lugosi. (2006). “Prediction, Learning, and Games.” In: Cambridge University Press. ISBN: 0521841089.

Subject Headings: Online Learning Algorithm; Online Sequence Labeling Algorithm

Notes

Cited By

Quotes

Book Overview

This important new text and reference for researchers and students in machine learning, game theory, statistics and information theory offers the first comprehensive treatment of the problem of predicting individual sequences. Unlike standard statistical approaches to forecasting, prediction of individual sequences does not impose any probabilistic assumption on the data-generating mechanism. Yet, prediction algorithms can be constructed that work well for all possible sequences, in the sense that their performance …

2 Prediction with expert advice
2.1 Weighted Average Prediction
2.2 An Optimal Bound
2.3 Bounds That Hold Uniformly over Time
2.4 An Improvement for Small Losses
2.5 Forecasters Using the Gradient of the Loss
2.6 Scaled Losses and Signed Games
2.7 The Multilinear Forecaster
2.8 The Exponential Forecaster for Signed Games
2.9 Simulatable Experts
2.10 Minimax Regret
2.11 Discounted Regret
2.12 Bibliographic Remarks
2.13 Exercises

3 Tight Bounds for Speci.c Losses
3.1 Introduction
3.2 Follow the Best Expert
3.3 Exp-concave Loss Functions
3.4 The Greedy Forecaster
3.5 The Aggregating Forecaster
3.6 Mixability for Certain Losses
3.7 General Lower Bounds
3.8 Bibliographic Remarks
3.9 Exercises
4 Randomized Prediction
4.1 Introduction
4.2 Weighted Average Forecasters
4.3 Follow the Perturbed Leader
4.4 Internal Regret
4.5 Calibration
4.6 Generalized Regret
4.7 Calibration with checking rules
4.8 Bibliographic Remarks
4.9 Exercises

5 Efficient Forecasters for Large Classes of Experts
5.1 Introduction
5.2 Tracking the Best Expert
5.3 Tree Experts
5.4 The Shortest Path Problem
5.5 Tracking the Best ofMany Actions
5.6 Bibliographic remarks
5.7 Exercises
6 Prediction with Limited Feedback
6.1 Introduction
6.2 Label e.cient prediction
6.3 Lower Bounds
6.4 PartialMonitoring
6.5 A General Forecaster for Partial-Monitoring
6.6 Hannan Consistency and Partial-Monitoring
6.7 Multi-armed bandit problems
6.8 An improved bandit strategy
6.9 Lower Bounds for the Bandit Problem
6.10 How to Select the Best Action
6.11 Bibliographic Remarks
6.12 Exercises

7 Prediction and Playing Games
7.1 Games and Equilibria
7.2 Minimax Theorems
7.3 Repeated Two-Player Zero-SumGames
7.4 Correlated Equilibrium and Internal Regret
7.5 Unknown Games: Game-Theoretic Bandits
7.6 Calibration and Correlated Equilibrium
7.7 Blackwell's Approachability Theorem
7.8 Potential-based Approachability
7.9 Convergence to Nash Equilibria
7.10 Convergence in Unknown Games
7.11 Playing Against Opponents That React
7.12 Bibliographic Remarks
7.13 Exercises

8 Absolute Loss
8.1 Simulatable Experts
8.2 Optimal Algorithmfor Simulatable Experts
8.3 Static Experts
8.4 A Simple Example
8.5 Bounds for Classes of Static Experts
8.6 Bounds for General Classes
8.7 Bibliographic Remarks
8.8 Exercises

9 Logarithmic Loss
9.1 Sequential Probability Assignment
9.2 Mixture Forecasters
9.3 Gambling and Data Compression
9.4 TheMinimax Optimal Forecaster
9.5 Examples
9.6 The LaplaceMixture
9.7 A Re.nedMixture Forecaster
9.8 Lower Bounds for Most Sequences
9.9 PredictionWith Side Information
9.10 A General Upper Bound
9.11 Further Examples
9.12 Bibliographic Remarks
9.13 Exercises

10 Sequential Investment
10.1 Portfolio Selection
10.2 TheMinimaxWealth Ratio
10.3 Prediction and Investment
10.4 Universal Portfolios
10.5 The eg Investment Strategy
10.6 Investment with Side Information
10.7 Bibliographic Remarks
10.8 Exercises

11 Linear Pattern Recognition
11.1 Prediction with Side Information
11.2 Bregman Divergences
11.3 Potential-based Gradient Descent
11.4 The Transfer Function
11.5 Forecasters Using Bregman Projections
11.6 Time-Varying Potentials
11.7 The Elliptic Potential
11.8 A Nonlinear Forecaster
11.9 Lower Bounds
11.10 Mixture Forecasters
11.11 Bibliographic Remarks
11.12 Exercises

12 Linear Classification
12.1 The Zero--One Loss
12.2 The Hinge Loss
12.3 MaximumMargin Classi.ers
12.4 Label E.cient Classi.ers
12.5 Kernel-based classi.ers
12.6 Bibliographic Remarks
12.7 Exercises

13 Appendix
13.1 Inequalities from Probability Theory
13.1.1 Hoeffding's inequality
13.1.2 Bernstein's inequality
13.1.3 Hoeffding--Azuma inequality and related results
13.1.4 Khinchine's inequality
13.1.5 Slud's inequality
13.1.6 A simple limit theorem
13.1.7 Proof of Theorem 8.3
13.1.8 Rademacher averages
13.1.9 The Beta distribution
13.2 Basic Information Theory
13.3 Basics of Classi.cation

1. Introduction

1.1 Prediction

Prediction, as we understand it in this book, is concerned with guessing the short-term evolution of certain phenomena. Examples of prediction problems are forecasting tomorrow’s temperature at a given location or guessing which asset will achieve the best performance over the next month. Despite their different nature, these tasks look similar at an abstract level: one must predict the next element of an unknown sequence given some knowledge about the past elements and possibly other available information. In this book we develop a formal theory of this general prediction problem. To properly address the diversity of potential applications without sacrificing mathematical rigor, the theory will be able to accommodate different formalizations of the entities involved in a forecasting task, such as the elements forming the sequence, the criterion used to measure the quality of a forecast, the protocol specifying how the predictor receives feedback about the sequence, and any possible side information provided to the predictor.

In the most basic version of the sequential prediction problem, the predictor – or forecaster – observes one after another the elements of a sequence [math]\displaystyle{ y_1,y_2,... }[/math] of symbols. At each time [math]\displaystyle{ t = 1, 2,... }[/math], before the [math]\displaystyle{ t }[/math]th symbol of the sequence is revealed, the forecaster guesses its value [math]\displaystyle{ y_t }[/math] on the basis of the previous [math]\displaystyle{ t-1 }[/math] observations.

In the classical statistical theory of sequential prediction, the sequence of elements, which we call outcomes, is assumed to be a realization of a stationary stochastic process. Under this hypothesis, statistical properties of the process may be estimated on the basis of the sequence of past observations, and effective prediction rules can be derived from these estimates. In such a setup, the risk of a prediction rule may be defined as the expected value of some loss function measuring the discrepancy between predicted value and true outcome, and different rules are compared based on the behavior of their risk. This book looks at prediction from a quite different angle. We abandon the basic assumption that the outcomes are generated by an underlying stochastic process and view the sequence y 1 , y 2 ,... as the product of some unknown and unspecified mechanism (which could be deterministic, stochastic, or even adversarially adaptive to our own behavior). To contrast it with stochastic modeling, this approach has often been referred to as prediction of individual sequences.

Without a probabilistic model, the notion of risk cannot be defined, and it is not immediately obvious how the goals of prediction should be set up formally. Indeed, several possibilities exist, many of which are discussed in this book. In our basic model, the performance of the forecaster is measured by the loss accumulated during many rounds of prediction, where loss is scored by some fixed loss function. Since we want to avoid any assumption on the way the sequence to be predicted is generated, there is no obvious baseline against which to measure the forecaster’s performance. To provide such a baseline, we introduce a class of reference forecasters , also called experts . These experts make their prediction available to the forecaster before the next outcome is revealed. The forecaster can then make his own prediction depend on the experts’ “advice” in order to keep his cumulative loss close to that of the best reference forecaster in the class. The difference between the forecaster’s accumulated loss and that of an expert is called regret , as it measures how much the forecaster regrets, in hindsight, of not having followed the advice of this particular expert. Regret is a basic notion of this book, and a lot of attention is payed to constructing forecasting strategies that guarantee a small regret with respect to all experts in the class. As it turns out, the possibility of keeping the regrets small depends largely on the size and structure of the class of experts, and on the loss function. This model of prediction using expert advice is defined formally in Chapter 2 and serves as a basis for a large part of the book.

The abstract notion of an “expert” can be interpreted in different ways, also depending on the specific application that is being considered. In some cases it is possible to view an expert as a black box of unknown computational power, possibly with access to private sources of side information. In other applications, the class of experts is collectively regarded as a statistical model, where each expert in the class represents an optimal forecaster for some given “state of nature.” With respect to this last interpretation, the goal of minimizing regret on arbitrary sequences may be thought of as a robustness requirement. Indeed, a small regret guarantees that, even when the model does not describe perfectly the state of nature, the forecaster does almost as well as the best element in the model fitted to the particular sequence of outcomes. In Chapters 2 and 3 we explore the basic possibilities and limitations of forecasters in this framework.

Models of prediction of individual sequences arose in disparate areas motivated by problems as different as playing repeated games, compressing data, or gambling. Because of this diversity, it is not easy to trace back the first appearance of such a study. But it is now recognized that Blackwell, Hannan, Robbins, and the others who, as early as in the 1950s, studied the so-called sequential compound decision problem were the pioneering contributors in the field. Indeed, many of the basic ideas appear in these early works, including the use of randomization as a powerful tool of achieving a small regret when it would otherwise be impossible. The model of randomized prediction is introduced in Chapter 4. In Chapter 6 several variants of the basic problem of randomized prediction are considered in which the information available to the forecaster is limited in some way.

Another area in which prediction of individual sequences appeared naturally and found numerous applications is information theory. The influential work of Cover, Davisson, Lempel, Rissanen, Shtarkov, Ziv, and others gave the information-theoretic foundations of sequential prediction, first motivated by applications for data compression and “universal” coding, and later extended to models of sequential gambling and investment. This theory mostly concentrates on a particular loss function, the so-called logarithmic or self-information loss, as it has a natural interpretation in the framework of sequential probability assignment. In this version of the prediction problem, studied in Chapters 9 and 10, at each time instance the forecaster determines a probability distribution over the set of possible outcomes. The total likelihood assigned to the entire sequence of outcomes is then used to score the forecaster. Sequential probability assignment has been studied in different closely related models in statistics, including bayesian frameworks and the problem of calibration in various forms. Dawid’s “prequential” statistics is also close in spirit to some of the problems discussed here.

In computer science, algorithms that receive their input sequentially are said to operate in an online modality. Typical application areas of online algorithms include tasks that involve sequences of decisions, like when one chooses how to serve each incoming request in a stream. The similarity between decision problems and prediction problems, and the fact that online algorithms are typically analyzed on arbitrary sequences of inputs, has resulted in a fruitful exchange of ideas and techniques between the two fields. However, some crucial features of sequential decision problems that are missing in the prediction framework (like the presence of states to model the interaction between the decision maker and the mechanism generating the stream of requests) has so far prevented the derivation of a general theory allowing a unified analysis of both types of problems.

1.2 Learning

Prediction of individual sequences has also been a main topic of research in the [[theory of machine learning]], more concretely in the area of online learning. In fact, in the late 1980s–early 1990s the paradigm of prediction with expert advice was first introduced as a model of online learning in the pioneering papers of De Santis, Markowski, and Wegman; Littlestone and Warmuth; and Vovk, and it has been intensively investigated ever since. An interesting extension of the model allows the forecaster to consider other information apart from the past outcomes of the sequence to be predicted. By considering side information taking values in a vector space, and experts that are linear functions of the side information vector, one obtains classical models of online pattern recognition. For example, Rosenblatt’s Perceptron algorithm, the Widrow-Hoff rule, and ridge regression can be naturally cast in this framework. Chapters 11 and 12 are devoted to the study of such online learning algorithms.

Researchers in machine learning and information theory have also been interested in the computational aspects of prediction. This becomes a particularly important problem when very large classes of reference forecasters are considered, and various tricks need to be invented to make predictors feasible for practical applications. Chapter 5 gathers some of these basic tricks illustrated on a few prototypical examples.

1.3 Games

The online prediction model studied in this book has an intimate connection with game theory. First of all, the model is most naturally defined in terms of a repeated game played between the forecaster and the “environment” generating the outcome sequence, thus offering a convenient way of describing variants of the basic theme. However, the connection is much deeper. For example, in Chapter 7 we show that classical minimax theorems of game theory can be recovered as simple applications of some basic bounds for the performance of sequential prediction algorithms. On the other hand, certain generalized minimax theorems, most notably Blackwell’s approachability theorem can be used to define forecasters with good performance on individual sequences.

Perhaps surprisingly, the connection goes even deeper. It turns out that if all players in a repeated normal form game play according to certain simple regret-minimizing prediction strategies, then the induced dynamics leads to equilibrium in a certain sense. This interesting line of research has been gaining terrain in game theory, based on the pioneering work of Foster, Vohra, Hart, Mas-Colell, and others. In Chapter 7 we discuss the possibilities and limitations of strategies based on regret minimizing forecasting algorithms that lead to various notions of equilibria.

1.4 A Gentle Start

To introduce the reader to the spirit of the results contained in this book, we now describe in detail a simple example of a forecasting procedure and then analyze its performance on an arbitrary sequence of outcomes. Consider the problem of predicting an unknown sequence y 1 , y 2 ,... of bits y t ∈{ 0 , 1 } . At each time t the forecaster first makes his guess p t ∈{ 0 , 1 } for y t . Then the true bit y t is revealed and the forecaster finds out whether his prediction was correct. To compute p t the forecaster listens to the advice of N experts. This advice takes the form of a binary vector ( f 1 , t ,..., f N , t ), where f i , t ∈{ 0 , 1 } is the prediction that expert i makes for the next bit y t . Our goal is to bound the number of time steps t in which p t �= y t , that is, to bound the number of mistakes made by the forecaster. To start with an even simpler case, assume we are told in advance that, on this particular sequence of outcomes, there is some expert i that makes no mistakes. That is, we know that f i , t = y t for some i and for all t , but we do not know for which i this holds. Using this information, it is not hard to devise a forecasting strategy that makes at most log 2 N mistakes on the sequence. To see this, consider the forecaster that starts by assigning a weight w j = 1 to each expert j = 1 ,..., N . At every time step t , the forecaster predicts with p t = 1 if and only if the number of experts j with w j = 1 and such that f j , t = 1 is bigger than those with w j = 1 and such that f j , t = 0. After y t is revealed, if p t �= y t , then the forecaster performs the assignment w k ← 0 on the weight of all experts k such that f k , t �= y t . In words, this forecaster keeps track of which experts make a mistake and predicts according to the majority of the experts that have been always correct. The analysis is immediate. Let W m be the sum of the weights of all experts after the forecaster has made m mistakes. Initially, m = 0 and W 0 = N . When the forecaster makes his m th mistake, at least half of the experts that have been always correct so far make their first mistake. This implies that W m ≤ W m − 1 / 2, since those experts that were incorrect for the first time have their weight zeroed by the forecaster. Since the above inequality holds for all m ≥ 1, we have W m ≤ W 0 / 2 m . Recalling that expert i never makes a mistake, we know that w i = 1, which implies that W m ≥ 1. Using this together with W 0 = N , we thus find that 1 ≤ N / 2 m . Solving for m (which must be an integer) gives the claimed inequality m ≤� log 2 N . We now move on to analyze the general case, in which the forecaster does not have any preliminary information on the number of mistakes the experts will make on the sequence. Our goal now is to relate the number of mistakes made by the forecaster to the number of mistakes made by the best expert, irrespective of which sequence is being predicted. Looking back at the previous forecasting strategy, it is clear that setting the weight of an incorrect expert to zero makes sense only if we are sure that some expert will never � � �

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 PredictionLearningandGames	Nicolò Cesa-Bianchi Gabor Lugosi			Prediction, Learning, and Games						2006