# Perceptron Training Algorithm

A Perceptron Training Algorithm is a linear supervised model-based binary classification algorithm that is used to train a perceptron network.

**Context:**- It can be implemented by a Perceptron training system (that can solve a perceptron training task to produce a Perceptron-based classifier).
- It can identify a Hyperplane that separates a Linearly Separable set of Training Vectors.
- It can range from being a Perceptron Classification Algorithm to being a Perceptron Regression Algorithm.
- with Predictive Classifier is
*h(x)*= Sign(*f(x)*), where*f()*is the Hyperplane Decision Boundary.

- with Predictive Classifier is

**Example(s):****Counter-Example(s):****See:**Dual Optimization Task, Neural Network Training Algorithm, Linear Model-based Classification Algorithm, Linear Predictor Function.

## References

### 2017a

- (Auer, 2017) ⇒ Peter Auer (2017). Online Learning. In: "Encyclopedia of Machine Learning and Data Mining" pp 929-937
- QUOTE: In this section we consider an example for an online learning algorithm that competes with a continuous set of experts, in contrast to the finite sets of experts we have considered so far. This algorithm — the perceptron algorithm (Rosenblatt 1958) — was among the first online learning algorithms developed. Another of this early online learning algorithms with a continuous set of experts is the Winnow algorithm by Littlestone (1988). A unified analysis of these algorithms can be found in Cesa-Bianchi and Lugosi (2006). This analysis covers a large class of algorithms, in particular the p-norm perceptrons, which smoothly interpolate between the perceptron algorithm and Winnow.
The perceptron algorithm aims at learning a linear classification function. Thus inputs are from a Euclidean space, [math]\displaystyle{ X = \mathbb{R}^{d} }[/math], the predictions and responses are binary, [math]\displaystyle{ Y = Z = { 0, 1} }[/math], and the discrete misclassification loss is used. Each expert is a linear classifier, represented by its weight vector [math]\displaystyle{ v \in \mathbb{R}^{d} }[/math], whose linear classification is given by [math]\displaystyle{ \Phi_v : X \rightarrow { 0, 1} }[/math], [math]\displaystyle{ \Phi_v (x) = 1 }[/math] if [math]\displaystyle{ v \cdot x \ge 0 }[/math] and [math]\displaystyle{ \Phi_{v, \theta}(x) = 0 }[/math] if [math]\displaystyle{ v \cdot x \lt 0 }[/math].

The perceptron algorithm maintains a weight vector [math]\displaystyle{ w_{t} \in \mathbb{R}^{d} }[/math] that is initialized as [math]\displaystyle{ w_1 = (0,\cdots, 0) }[/math]. After receiving input [math]\displaystyle{ x_t }[/math], the perceptron’s prediction is calculated using this weight,

[math]\displaystyle{ y_{t} = \Phi _{w_{t}}(x_{t}) }[/math],

and the weight vector is updated,

[math]\displaystyle{ w_{t+1} = w_{t} +\eta (z_{t} - y_{t})x_{t}, }[/math]

where [math]\displaystyle{ \eta \gt 0 }[/math] is a learning rate parameter. Thus, if the prediction is correct, [math]\displaystyle{ y_t = z_t }[/math], then the weights are not changed. Otherwise, the product [math]\displaystyle{ w_{t+1} \cdot x_t }[/math] is moved into the correct direction: since [math]\displaystyle{ w_{t+1} \cdot x_t = w_t \cdot x_t +\eta(z_t − y_t ) \parallel x_t \parallel^ 2, w_{t+1} \cdot x_t \gt w_t \cdot x_t }[/math] if [math]\displaystyle{ y_t = 0 }[/math] but [math]\displaystyle{ z_t = 1 }[/math], and [math]\displaystyle{ w_{t+1} \cdot x_t \lt w_t \cdot x_t }[/math] if [math]\displaystyle{ y_t = 1 }[/math] but [math]\displaystyle{ z_t = 0 }[/math].

We may assume that the inputs are normalized, [math]\displaystyle{ \parallel x_t \parallel = 1 }[/math], otherwise a normalized [math]\displaystyle{ x_t }[/math] can be used in the update of the weight vector. Furthermore, we note that the learning rate [math]\displaystyle{ \eta }[/math] is irrelevant for the performance of the perceptron algorithm, since it scales only the size of the weights but does not change the predictions. Nevertheless, we keep the learning rate since it will simplify the analysis.

- QUOTE: In this section we consider an example for an online learning algorithm that competes with a continuous set of experts, in contrast to the finite sets of experts we have considered so far. This algorithm — the perceptron algorithm (Rosenblatt 1958) — was among the first online learning algorithms developed. Another of this early online learning algorithms with a continuous set of experts is the Winnow algorithm by Littlestone (1988). A unified analysis of these algorithms can be found in Cesa-Bianchi and Lugosi (2006). This analysis covers a large class of algorithms, in particular the p-norm perceptrons, which smoothly interpolate between the perceptron algorithm and Winnow.

### 2017b

- (Digio, 2017) ⇒ Digio. (2017). “Differences between logistic regression and perceptrons." URL (version: 2017-06-07): https://stats.stackexchange.com/q/284013
- QUOTE: … Logistic regression models a function of the mean of a Bernoulli distribution as a linear equation (the mean being equal to the probability p of a Bernoulli event). By using the logit link as a function of the mean (p), the logarithm of the odds (log-odds) can be derived analytically and used as the response of a so-called generalised linear model. Parameter estimation on this GLM is then a statistical process which yields p-values and confidence intervals for model parameters. On top of prediction, this allows you to interpret the model in causal inference. This is something that you cannot achieve with a linear Perceptron.
The Perceptron is a reverse engineering process of logistic regression: Instead of taking the logit of y, it takes the inverse logit (logistic) function of wx, and doesn't use probabilistic assumptions for neither the model nor its parameter estimation. Online training will give you exactly the same estimates for the model weights/parameters, but you won't be able to interpret them in causal inference due to the lack of p-values, confidence intervals, and well, an underlying probability model.

Long story short, logistic regression is a GLM which can perform prediction and inference, whereas the linear Perceptron can only achieve prediction (in which case it will perform the same as logistic regression). The difference between the two is also the fundamental difference between statistical modelling and machine learning.

- QUOTE: … Logistic regression models a function of the mean of a Bernoulli distribution as a linear equation (the mean being equal to the probability p of a Bernoulli event). By using the logit link as a function of the mean (p), the logarithm of the odds (log-odds) can be derived analytically and used as the response of a so-called generalised linear model. Parameter estimation on this GLM is then a statistical process which yields p-values and confidence intervals for model parameters. On top of prediction, this allows you to interpret the model in causal inference. This is something that you cannot achieve with a linear Perceptron.

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/perceptron Retrieved:2015-5-13.
- In machine learning, the
**perceptron**is an algorithm for supervised learning of binary classifiers: functions that can decide whether an input (represented by a vector of numbers) belong to one class or another. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time.The perceptron algorithm dates back to the late 1950s; its first implementation, in custom hardware, was one of the first artificial neural networks to be produced.

- In machine learning, the

### 2011

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Perceptron#Learning_algorithm
- Below is an example of a learning algorithm for a single-layer (no hidden layer) perceptron. For multilayer perceptrons, more complicated algorithms such as backpropagation must be used. Alternatively, methods such as the delta rule can be used if the function is non-linear and differentiable, although the one below will work as well.
The learning algorithm we demonstrate is the same across all the output neurons, therefore everything that follows is applied to a single neuron in isolation. We first define some variables:

- [math]\displaystyle{ y = f(\mathbf{z}) \, }[/math] denotes the
*output*from the perceptron for an input vector [math]\displaystyle{ \mathbf{z} }[/math]. - [math]\displaystyle{ b \, }[/math] is the
*bias*term, which in the example below we take to be 0. - [math]\displaystyle{ D = \{(\mathbf{x}_1,d_1),\dots,(\mathbf{x}_s,d_s)\} \, }[/math] is the
*training set*of [math]\displaystyle{ s }[/math] samples, where:- [math]\displaystyle{ \mathbf{x}_j }[/math] is the [math]\displaystyle{ n }[/math]-dimensional input vector.
- [math]\displaystyle{ d_j \, }[/math] is the desired output value of the perceptron for that input.

- [math]\displaystyle{ y = f(\mathbf{z}) \, }[/math] denotes the
- We show the values of the nodes as follows:
- [math]\displaystyle{ x_{j,i} \, }[/math] is the value of the [math]\displaystyle{ i }[/math]th node of the [math]\displaystyle{ j }[/math]th training
*input vector*. - [math]\displaystyle{ x_{j,0} = 1 \, }[/math].

- [math]\displaystyle{ x_{j,i} \, }[/math] is the value of the [math]\displaystyle{ i }[/math]th node of the [math]\displaystyle{ j }[/math]th training
- To represent the weights:
- [math]\displaystyle{ w_i \, }[/math] is the [math]\displaystyle{ i }[/math]th value in the
*weight vector*, to be multiplied by the value of the [math]\displaystyle{ i }[/math]th input node.

- [math]\displaystyle{ w_i \, }[/math] is the [math]\displaystyle{ i }[/math]th value in the
- An extra dimension, with index [math]\displaystyle{ n+1 }[/math], can be added to all input vectors, with [math]\displaystyle{ x_{j,n+1}=1 \, }[/math], in which case [math]\displaystyle{ w_{n+1} \, }[/math] replaces the bias term.
To show the time-dependence of [math]\displaystyle{ \mathbf{w} }[/math], we use:

- [math]\displaystyle{ w_i(t) \, }[/math] is the weight [math]\displaystyle{ i }[/math] at time [math]\displaystyle{ t }[/math].
- [math]\displaystyle{ \alpha \, }[/math] is the
*learning rate*, where [math]\displaystyle{ 0 \lt \alpha \leq 1 }[/math].

- Too high a learning rate makes the perceptron periodically oscillate around the solution unless additional steps are taken.

- Below is an example of a learning algorithm for a single-layer (no hidden layer) perceptron. For multilayer perceptrons, more complicated algorithms such as backpropagation must be used. Alternatively, methods such as the delta rule can be used if the function is non-linear and differentiable, although the one below will work as well.

### 2009

- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Perceptron#Learning_algorithm
- The learning algorithm is the same across all neurons, therefore everything that follows is applied to a single neuron in isolation. … Learning is modeled as the weight vector being updated for multiple iterations over all training examples. ...

Sample: (xi,ti), ti in {-1,+1} If ti <wk,xi> < 0 THEN /* Error*/ wk+1 = wk + ti xi k=k+1 until (error==false) return k,(wk,bk) where k is the number of mistakes

### 2007

- (Surdeanu and Ciaramita, 2007) ⇒ Mihai Surdeanu, and Massimiliano Ciaramita. (2007). “Robust Information Extraction with Perceptrons.” In: Proceedings of NIST 2007 Automatic Content Extraction Workshop.
- ABSTRACT: We present a system for the extraction of entity and relation mentions. Our work focused on robustness and simplicity: all system components are modeled using variants of the Perceptron algorithm (Rosemblatt, 1958) and only partial syntactic information is used for feature extraction. Our approach has two novel ideas. First, we define a new large-margin Perceptron algorithm tailored for class-unbalanced data which dynamically adjusts its margins, according to the generalization performance of the model. Second, we propose a novel architecture that lets classification ambiguities flow through the system and solves them only at the end. The system achieves competitive accuracy on the ACE English EMD and RMD tasks.

### 1990

- (Gallant, 1990) ⇒ S. I. Gallant. (1990). "Perceptron-based Learning Algorithms". In: IEEE Transactions on Neural Networks, 1(2). DOI: 10.1109/72.80230
- ABSTRACT: A key task for connectionist research is the development and analysis of learning algorithms. An examination is made of several supervised learning algorithms for single-cell and network models. The heart of these algorithms is the pocket algorithm, a modification of perceptron learning that makes perceptron learning well-behaved with nonseparable training data, even if the data are noisy and contradictory. Features of these algorithms include speed algorithms fast enough to handle large sets of training data; network scaling properties, i.e. network methods scale up almost as well as single-cell models when the number of inputs is increased; analytic tractability, i.e. upper bounds on classification error are derivable; online learning, i.e. some variants can learn continually, without referring to previous data; and winner-take-all groups or choice groups, i.e. algorithms can be adapted to select one out of a number of possible classifications. These learning algorithms are suitable for applications in machine learning, pattern recognition, and connectionist expert systems.

.