# Naïve Bayes (NB) Classification Algorithm

A Naïve Bayes (NB) classification algorithm is probabilistic learning algorithm that is a linear generative classification algorithm (i.e. that makes the conditional independence assumption).

**Context:**- It can be implemented by a Naïve Bayes Classification System (to solve a Naïve Bayes classification task).
- It can range from being a Binomial Naïve Bayes to being a Multinomial Naïve Bayes.
- It can range from being a Univariate Naïve Bayes to being a Multivariate Naïve Bayes.
- It can produce a Naive-Bayes Classification Model.
- It assumes Conditional Independence between Features.
- It can have Computational Complexity of [math]\displaystyle{ \Theta(n m) }[/math], where [math]\displaystyle{ n }[/math] is the number of observations and [math]\displaystyle{ m }[/math] is the average feature length.
- It can be robust to Noisy Data.
- It can naturally handle Missing Values.
- It can be robust to Irrelevant Features.
- Its Performance can degrade when there are correlated features (when the Conditional Independence assumption does not hold).

**Example(s):**- [math]\displaystyle{ P(c_i, s) = P(c_i) P(s|c_i) }[/math].
- [math]\displaystyle{ \operatorname{max}... }[/math]

**Counter-Example(s):****See:**Learning Algorithm; Generative Statistical Model; Bayes Rule; Bayesian Method; Bayesian Networks; Semi-Naïve Bayesian Learning.

## References

### 2014

- http://en.wikipedia.org/wiki/Linear_classifier#Generative_models_vs._discriminative_models
- … Methods of the first class model conditional density functions [math]\displaystyle{ P(\vec x|{\rm class}) }[/math]. Examples of such algorithms include:

### 2011

- (Webb, 2011g) ⇒ Geoffrey I. Webb. (2011). “Naïve Bayes.” In: (Sammut & Webb, 2011) p.713
- http://en.wikipedia.org/wiki/Linear_classifier#Generative_models_vs._discriminative_models

### 2011b

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Naive_Bayes_classifier
- A
**Naive Bayes classifier**is a simple probabilistic classifier based on applying**Bayes' theorem**with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers.

^{[1]}Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests.^{[2]}An advantage of the naive Bayes classifier is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

- A

- ↑ Harry Zhang "The Optimality of Naive Bayes". FLAIRS2004 conference.
*(available online: PDF)* - ↑ Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised learning algorithms". Proceedings of the 23rd International Conference on Machine learning, 2006.
*(available online PDF)*

### 2009

- (Hand, 2009) ⇒ David J. Hand. (2009). “Naïve Bayes.” In: (Wu & Kumar, 2009).

### 2008

- (Wu, Kumar et al., 2008) ⇒ Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu and Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. (2008). “Top 10 Algorithms in Data Mining.” In: Knowledge and Information Systems, 14(1). doi:10.1007/s10115-007-0114-2
- QUOTE:One very important one is the naive Bayes method — also called idiot’s Bayes, simple Bayes, and independence Bayes. This method is important for several reasons. It is very easy to construct, not needing any complicated iterative parameter estimation schemes. This means it may be readily applied to huge data sets. It is easy to interpret, so users unskilled in classiﬁer technology can understand why it is making the classiﬁcation it makes. And ﬁnally, it often does surprisingly well: it may not be the best possible classiﬁer in any particular application, but it can usually be relied on to be robust and to do quite well. General discussion of the naive Bayes method and its merits are given in (Domingos & Pazzani, 1997; Hand & Yu, 2001).

### 2004

- (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
- QUOTE:
- In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability
*p*(*x*;*y*). If we can approximate*p*(*x*,*y*) using a parametric family of models [math]\displaystyle{ G }[/math] = {*p*_{θ}(x*,*y*),*θ*∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called*generative*classification.* - However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]\displaystyle{ p(y \vert x) }[/math]. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
- Conversely, the generative approach converges to the best model for the joint distribution
*p*(*x*,*y*) but the resulting conditional density is usually a biased classifier unless its*p*_{θ}(x*) part is an accurate model for*p*(*x*). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.*

- In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability

- QUOTE:

### 2001

- (Hand & Yu, 2001) ⇒ David J. Hand, and Keming Yu. (2001). “Idiot's Bayes - not so stupid after all?.” In: International Statistical Review, 69(3).
- QUOTE:Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence For this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective. … In this paper, following almost all of the work on the idiot’s Bayes method, we adopt a frequentist interpretation. … The phenomenon is not limited to medicine. Other studies which found that the independence Bayes method performed very well, often better than the alternatives, include Cestnik. Kononenko & Bratko (1987). Clark & Niblett (1989). Cestnik (1990), Langley, Iba & Thompson (1992). Pazzani, Muramatsu & Billsus (1996), Friedman, Geiger & Goldszmidt (1997). and Domingos & Pazzani (1997).

- (Rich, 2001) ⇒ Irina Rish. (2001). “An Empirical Study of the Maive Bayes Classifier.” In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.

### 1998

- (McCallum & Niham, 1998) ⇒ Andrew McCallum, and Kamal Nigam. (1998). “A Comparison of Event Models for Naive Bayes Text Classification.” In: AAAI/ICML-98 Workshop on Learning for Text Categorization.

### 1997

- (Domingos & Pazzani, 1997) ⇒ Pedro Domingos, and Michael J. Pazzani. (1997). “On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss.” In: Machine Learning, 29.

### 1995

- (John & Langley, 1995) ⇒ George H. John, and Pat Langley. (1995). “Estimating Continuous Distributions in Bayesian Classifiers.” In: Proceedings of the eleventh Conference on Uncertainty in Artificial Intelligence (UAI 1995).