# Fisher's Linear Discriminant Analysis Algorithm

A Fisher's Linear Discriminant Analysis Algorithm is a linear discriminant analysis algorithm that does not assume normally distributed classes or equal class covariances.

**Context:**- It can be applied by a Fisher's Linear Discriminant Analysis System.

**See:**Linear Classification Algorithm.

## References

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/linear_discriminant_analysis#Fisher Retrieved:2015-7-8.
- The terms
*Fisher's linear discriminant*and*LDA*are often used interchangeably, although Fisher's original article^{[1]}actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances.Suppose two classes of observations have means [math]\displaystyle{ \vec \mu_0, \vec \mu_1 }[/math] and covariances [math]\displaystyle{ \Sigma_0,\Sigma_1 }[/math] . Then the linear combination of features [math]\displaystyle{ \vec w \cdot \vec x }[/math] will have means [math]\displaystyle{ \vec w \cdot \vec \mu_i }[/math] and variances [math]\displaystyle{ \vec w^T \Sigma_i \vec w }[/math] for [math]\displaystyle{ i=0,1 }[/math] . Fisher defined the separation between these two distributions to be the ratio of the variance between the classes to the variance within the classes: : [math]\displaystyle{ S=\frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}= \frac{(\vec w \cdot \vec \mu_1 - \vec w \cdot \vec \mu_0)^2}{\vec w^T \Sigma_1 \vec w + \vec w^T \Sigma_0 \vec w} = \frac{(\vec w \cdot (\vec \mu_1 - \vec \mu_0))^2}{\vec w^T (\Sigma_0+\Sigma_1) \vec w} }[/math] This measure is, in some sense, a measure of the signal-to-noise ratio for the class labelling. It can be shown that the maximum separation occurs when : [math]\displaystyle{ \vec w \propto (\Sigma_0+\Sigma_1)^{-1}(\vec \mu_1 - \vec \mu_0) }[/math] When the assumptions of LDA are satisfied, the above equation is equivalent to LDA.

Be sure to note that the vector [math]\displaystyle{ \vec w }[/math] is the normal to the discriminant hyperplane. As an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to [math]\displaystyle{ \vec w }[/math] .

Generally, the data points to be discriminated are projected onto [math]\displaystyle{ \vec w }[/math] ; then the threshold that best separates the data is chosen from analysis of the one-dimensional distribution. There is no general rule for the threshold. However, if projections of points from both classes exhibit approximately the same distributions, a good choice would be the hyperplane between projections of the two means, [math]\displaystyle{ \vec w \cdot \vec \mu_0 }[/math] and [math]\displaystyle{ \vec w \cdot \vec \mu_1 }[/math] . In this case the parameter c in threshold condition [math]\displaystyle{ \vec w \cdot \vec x \gt c }[/math] can be found explicitly: : [math]\displaystyle{ c = \vec w \cdot \frac12 (\vec \mu_0 + \vec \mu_1) = \frac{1}{2} \vec\mu_1^t \Sigma^{-1} \vec\mu_1 - \frac{1}{2} \vec\mu_0^t \Sigma^{-1} \vec\mu_0 }[/math] .

Otsu's Method is related to Fisher's linear discriminant, and was created to binarize the histogram of pixels in a grayscale image by optimally picking the black/white threshold that minimizes intra-class variance and maximizes inter-class variance within/between grayscales assigned to black and white pixel classes.

- The terms

- ↑ Fisher, 1936.

### 2010

- (Durrant et al., 2010) ⇒ Robert J. Durrant, and Ata Kaban. (2010). “Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data.” In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2010). doi:10.1145/1835804.1835945
- QUOTE: FLD is a generative classifier that seeks to model, given training data TN, the optimal decision boundary between classes. It is a successful and widely used classification method. The classical version is formulated for 2-class problems, as follows. If [math]\displaystyle{ \pi_0, \Sigma = \Sigma_0 = \Sigma_1 }[/math] and [math]\displaystyle{ \mu_0 }[/math]and [math]\displaystyle{ \mu_1 }[/math] areknown then the optimal classifier is given by Bayes’ rule (4):

[math]\displaystyle{ h(X_q) = 1 \Big\{ \log\frac{(1-\pi_0)f_1(X_q)}{\pi_0\;f_0(X_q)}\Big\}=1 \Big\{ \log\frac{1-\pi_0}{\pi_0}+(\mu_0-\mu_1)\Sigma^{-1}\Big(X_q-\frac{\mu_0+\mu_1}{2}\Big)\Big\} }[/math]

- where [math]\displaystyle{ 1(A) }[/math] is the indicator function that returns one if A is true and zero otherwise, and [math]\displaystyle{ fy }[/math] is the probability density function of the y-th data-class, in its simplest form the multivariate Gaussian [math]\displaystyle{ N (\mu_y, \Sigma) }[/math], namely:

[math]\displaystyle{ \bigg((2\pi)^{d/2}\;det\Sigma^{1/2}\bigg)^{-1} \exp\bigg(-\frac{1}{2}(X-\mu_y)^{T}\Sigma^{-1}(X-\mu_y)\bigg) }[/math]