2009 CoCoCodingCostforParameterFreeO

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Outlier Detection, Coding Costs, Minimum Description Length, Data Compression

Abstract

How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks : The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, an techniques for parameter-free outlier detection. The basic idea of our techniques relates outlier detection to data compression : Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our techniques. Availability: The source code of CoCo and the data sets used in the experiments are available at : http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 CoCoCodingCostforParameterFreeOChristian Böhm
Katrin Haegler
Nikola S. Müller
Claudia Plant
CoCo: Coding Cost for Parameter-free Outlier DetectionKDD-2009 Proceedings10.1145/1557019.15570422009