Outlier Detection Algorithm

Jump to: navigation, search

An outlier detection algorithm is a detection algorithm that can be applied by an outlier detection system (to solve an outlier detection task).




  • (Aggarwal, 2013) ⇒ Charu C. Aggarwal. (2013). “Outlier Analysis." Springer Publishing Company, Incorporated. ISBN:1461463955, 9781461463955 doi:10.1007/978-1-4614-6396-2
    • QUOTE: The classical books relevant to outlier analysis are as follows:
      • P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 2003.
      • V. Barnett and T. Lewis. Outliers in Statistical Data, Wiley, 1994.
      • D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
    • We note that these books are quite outdated, and the most recent among them is a decade old. Furthermore, this (most recent) book is really focussed on the relationship between regression and outlier analysis, rather than the latter. Outlier analysis is a much broader area, in which regression analysis is only a small part.







  • (Breunig et al., 2000) ⇒ Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. (2000). “LOF: identifying density-based local outliers.” In: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2000). doi:10.1145/335191.335388
    • QUOTE: Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories.

      The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the probability distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [5]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results.

      The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ³ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of W(nk/2) for n objects. …

      … Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.


  • (Knorr & Ng, 1998) ⇒ E. Knorr, and Raymond Ng. (1998). “Algorithms for Mining Distance-based Outliers in Large Data Sets.” In: Proceedings of the 24th International Conference on Very Large Databases (VLDB 1998).