2005 Smooth949InsensitiveRegressionb

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Regressor, Regression Algorithm, Linear Regressor.

Notes

Cited By

Quotes

Author Keywords

Abstract

We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.

1. Introduction

The focus of this paper is supervised learning of real-valued functions. We observe a sequence [math]\displaystyle{ S = {(x_1,y_1),...,(x_m,y_m)} }[/math] of instance-target pairs, where the instances are vectors in Rn and the targets are real-valued scalars, yi 2 R. Our goal is to learn a function [math]\displaystyle{ f : \mathbb{R}^n \rightarrow \mathbb{R} }[/math] which provides a good approximation of the target values from their corresponding instance vectors. Such a function is often referred to as a regression function or a regressor for short. Regression problems have long been the focus of research papers in statistics and learning theory (see for instance the book by [[Hastie, Tibshirani, and Friedman (2001)]] and the references therein). In this paper we discuss [[learning of linear regressor]]s, that is, [math]\displaystyle{ f }[/math] is of the form [math]\displaystyle{ f(\bf{x}) = \lambda \cdot \bf{x} }[/math]. This setting is also suitable for learning a linear combination of base regressors of the form [math]\displaystyle{ f(\bf{x}) = \Sigma^l_{j=1} \lambda_j h_j(\bf{x}) }[/math] where each base regressor [math]\displaystyle{ h_j }[/math] is a mapping from an instance domain [math]\displaystyle{ X }[/math] into [math]\displaystyle{ \R }[/math]. The latter form enables us to employ kernels by setting [math]\displaystyle{ h_j(\bf{x}) = K(x_j,\bf{x}) }[/math].

The class of linear regressors is rather restricted. Furthermore, in real applications both the instances and the target values are often corrupted by noise and a perfect mapping such that for all [math]\displaystyle{ (x_i, y_i) \in S }[/math], [math]\displaystyle{ f(x_i) = y_i }[/math] is usually unobtainable. Hence, we employ a loss function [math]\displaystyle{ L : \mathbb{R} \times \mathbb{R} \rightarrow \R_+ }[/math] which determines the penalty for a discrepancy between the predicted target, [math]\displaystyle{ f(\bf{x}) }[/math], and the true (observed) target [math]\displaystyle{ y }[/math]. As we discuss shortly, the [[loss function]]s we consider in this paper depend only on the discrepancy between the [[predicted target]] and the true target [math]\displaystyle{ \delta = f(\bf{x}) − y }[/math], hence [math]\displaystyle{ L }[/math] can be viewed as a function from</math>\R</math> into [math]\displaystyle{ \R_+ }[/math]. We therefore allow ourselves to overload our notation and denote [math]\displaystyle{ L(\delta) = L(f(\bf{x}),y) }[/math].

Given a loss function</math>L</math>, the goal of a regression algorithm is to find a regressor [math]\displaystyle{ f }[/math] which attains a small total loss on the training set [math]\displaystyle{ S }[/math],

[math]\displaystyle{ \text{Loss}(\lambda,S) = \Sigma^m_{i=1}L(f(x_i)-y_i) = \Sigma^m_{i=1}L(\lambda \cdot x_i - y_1) }[/math]

Denoting the discrepancy [math]\displaystyle{ \lambda \cdot x_i − y_i }[/math] by [math]\displaystyle{ \delta_i }[/math], we note that two common approaches to solving regression problems are to minimize either the sum of the absolute discrepancies over the sample ( P i |�i|) or the sum of squared discrepancies ( P i �2 i). It has been argued that the squared loss is sensitive to outliers, hence robust regression algorithms often employ the absolute loss (Huber, 1981). Furthermore, it is often the case that the exact discrepancy between � · x and y is unimportant so long as it falls below an insensitivity parameter “. Formally, the "-insensitive hinge loss, denoted |�|", is zero if |�| � " and is |�| − " for |�| > " (see also the left hand side of Figure 2). The "-insensitive hinge loss is not smooth as its derivative is discontinuous at � = ±". Several batch learning algorithms have been proposed for minimizing the "-insensitive hinge loss (see for example Vapnik, 1998; Smola and Sch¨olkopf, 1998). However, these algorithms are based on rather complex constrained optimization techniques since the "-insensitive hinge loss is a non-smooth function.

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 Smooth949InsensitiveRegressionbYoram Singer
Shai Shalev-Shwartz
Ofer Dekel
Smooth ε-Insensitive Regression by Loss Symmetrization2005