# Naïve Bayes (NB) Classification Algorithm

(Redirected from Naive Bayes)

## References

### 2011b

• (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Naive_Bayes_classifier
• A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests.[2]

An advantage of the naive Bayes classifier is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

1. Harry Zhang "The Optimality of Naive Bayes". FLAIRS2004 conference. (available online: PDF)
2. Caruana, R. and Niculescu-Mizil, A.: "An empirical comparison of supervised learning algorithms". Proceedings of the 23rd International Conference on Machine learning, 2006. (available online PDF)

### 2004

• (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
• QUOTE:
• In supervised classification, inputs $x$ and their labels $y$ arise from an unknown joint probability p(x ; y). If we can approximate p(x,y) using a parametric family of models $G$ = {pθ(x,y),θ ∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.
• However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density $p(y \vert x)$. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
• Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its pθ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.

### 2001

• (Hand & Yu, 2001) ⇒ David J. Hand, and Keming Yu. (2001). “Idiot's Bayes - not so stupid after all?.” In: International Statistical Review, 69(3).
• QUOTE:Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence For this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective. … In this paper, following almost all of the work on the idiot’s Bayes method, we adopt a frequentist interpretation. … The phenomenon is not limited to medicine. Other studies which found that the independence Bayes method performed very well, often better than the alternatives, include Cestnik. Kononenko & Bratko (1987). Clark & Niblett (1989). Cestnik (1990), Langley, Iba & Thompson (1992). Pazzani, Muramatsu & Billsus (1996), Friedman, Geiger & Goldszmidt (1997). and Domingos & Pazzani (1997).
• (Rich, 2001) ⇒ Irina Rish. (2001). “An Empirical Study of the Maive Bayes Classifier.” In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.