2000 LOFIdentDensBasedLocOutliers

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Outlier Detection Task, Outlier Detection Algorithm.

Notes

Cited by

Quotes

Abstract

For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.

2. Related Work

Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories. The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the probability distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [5]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results.

The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ³ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of W(nk/2) for n objects.

Recently, Knorr and Ng proposed the notion of distance-based outliers [13], [14]. Their notion generalizes many notions from the distribution- based approaches, and enjoys better computational complexity than the depth-based approaches for larger values of k. Later in section 3, we will discuss in detail how their notion is different from the notion of local outliers proposed in this paper. In [17] the notion of distance based outliers is extended by using the distance to the k-nearest neighbor to rank the outliers. A very efficient algorithms to compute the top n outliers in this ranking is given, but their notion of an outlier is still distance-based. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application domains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models.

Finally, most clustering algorithms, especially those developed in the context of KDD (e.g. CLARANS [15], DBSCAN [7], BIRCH [23], STING [22], WaveCluster [19], DenClue [11], CLIQUE [3]), are to some extent capable of handling exceptions. However, since the main objective of a clustering algorithm is to find clusters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called “noise” in the context of clustering) are typically just tolerated or ignored when producing the clustering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there are no quantification as to how outlying an object is. Our notion of local outliers share a few fundamental concepts with density-based clustering approaches. However, our outlier detection method does not require any explicit or implicit notion of clusters.

3. PROBLEMS OF EXISTING (NON-LOCAL) APPROACHES

Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

However, for many interesting real-world datasets which exhibit a more complex structure, there is another kind of outliers.

7. Experiments

In this section, with the proposed heuristic of taking the maximum LOF value within the range, we show that our ideas can be used to successfully identify outliers which appear to be meaningful but cannot be identified by other methods. We start with a synthetical 2-dimensional dataset, for which we show the outlier factors for all objects, in order to give an intuitive notion of the LOF values computed. The second example uses the real-world dataset that has been used in [KN98] to evaluate the DB(pct, dmin) outliers.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 LOFIdentDensBasedLocOutliersHans-Peter Kriegel
Markus M. Breunig
Raymond T. Ng
Jörg Sander
LOF: Identifying density-based local outliershttp://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/LOF.pdf10.1145/335191.335388