2004 ASurveyofOutlierDetectionMethod

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Outlier Detection, Outlier Detection Algorithm.

Notes

2009

Cited By

Quotes

Author Keywords

Abstract

Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

1.0 Introduction

Outlier detection encompasses aspects of a broad spectrum of techniques. Many techniques mployed for detecting outliers are fundamentally identical but with different names chosen by the authors. For example, authors describe their various approaches as outlier detection, novelty detection, anomaly detection, noise detection, deviation detection or exception mining. In this paper, we have chosen to call the technique outlier detection although we also use novelty detection where we feel appropriate but we incorporate approaches from all five categories named above. Additionally, authors have proposed many definitions for an outlier with seemingly no universally accepted definition. We will take the definition of Grubbs (Grubbs, 1969) and quoted in Barnett & Lewis (Barnett and Lewis, 1994):: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.

A further outlier definition from Barnett & Lewis (Barnett and Lewis, 1994) is:: An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. In figure 2, there are five outlier points labelled V, W, X, Y and Z which are clearly isolated and inconsistent with the main cluster of points. The data in the figures in this survey paper is adapted from the Wine data set (Blake + Merz, 1998).

John (John, 1995) states that an outlier may also be “surprising veridical data”, a point belonging to class A but actually situated inside class B so the true (veridical) classification of the point is surprising to the observer. Aggarwal (Aggarwal and Yu, 2001) notes that outliers may be considered as noise points lying outside a set of defined clusters or alternatively outliers may be defined as the points that lie outside of the set of clusters but are also separated from the noise. These outliers behave differently from the norm. In this paper, we focus on the two definitions quoted from (Barnett and Lewis, 1994) above and do not consider the dual class-membership problem or separating noise and outliers.

A more exhaustive list of applications that utilise outlier detection is:

Outliers arise because of human error, instrument error, natural deviations in populations, fraudulent behaviour, changes in behaviour of systems or faults in systems. How the outlier detection system deals with the outlier depends on the application area.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 ASurveyofOutlierDetectionMethodVictoria Hodge
Jim Austin
A Survey of Outlier Detection Methodologies10.1023/B:AIRE.0000045502.10941.a92004