2012 AnomalyDetectionforDiscreteSequ

(Chandola et al., 2012) ⇒ Varun Chandola, Arindam Banerjee, and Vipin Kumar. (2012). “Anomaly Detection for Discrete Sequences: A Survey.” In: IEEE Transactions on Knowledge and Data Engineering Journal, 24(5). doi:10.1109/TKDE.2010.235

Subject Headings: Outlier Detection, Survey Paper.

Notes

Cited By

Quotes

Abstract

This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete / symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous. We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.

1. Introduction

Sequence data is found in a wide variety of application domains such as intrusion detection, bio-informatics, weather prediction, system health management, etc. Hence anomaly detection for sequence data is an important topic of research. There is extensive work on anomaly detection techniques [1–3] that look for individual objects that are different from normal objects. These techniques do not take the sequence structure of the data into consideration. For example, consider the set of user command sequences shown in Table 1. While sequences S1–S4 represent normal daily profiles of a user, the sequence S5 is possibly an attempt to break into a computer by trying different passwords. Though the sequence S5 is anomalous, each command in the sequence by itself is normal.

A sequence is an ordered series of events. Sequences can be of different types, such as binary, discrete, and continuous, depending on the type of events that form the sequences. Discrete and continuous sequences (or time series) are two most important form of sequences encountered in real life. In this survey we will focus on discrete sequences. Discrete or symbolic sequences are ordered sets of events such that the events are symbols belonging to a finite unordered alphabet. For example, a text document is a sequence of words, a computer program is executed as a sequence of system calls, a gene is a sequence of nucleic acids.

S1 login, passwd,mail, ssh, . . . ,mail,web, logout
S2 login, passwd,mail,web, . . . ,web,web,web, logout
S3 login, passwd,mail, ssh, . . . ,mail,web,web, logout
S4 login, passwd,web,mail, ssh, . . . ,web,mail, logout
S5 login, passwd, login, passwd, login, passwd, . . . , logout
TABLE 1
Sequences of User Commands

Often, one is interested in detecting anomalies in discrete sequences to find possible intrusions, frauds, faults, contaminations. For example, the sequences of commands issued by computer users (as shown in Table 1) are collected to detect possible intrusive activities. Anomaly detection for discrete sequences is a challenging task, since it involves exploiting the sequential nature of data to detect anomalies. Some of the specific challenges are as follows:

Multiple definitions of anomalies exist for sequences; an event within a sequence maybe anomalous, a subsequence within a sequence maybe anomalous, or the entire sequence maybe anomalous. Each definition needs to be handled differently. For example, given a data base of protein sequences, one might be interested in sequences that are anomalous. On the other hand, given a long sequence of system calls executed in a computer system, one might be interested in detecting subsequences when the computer system was under a virus attack. A technique that detects anomalous events within a sequence might not be directly applicable to detecting anomalies that are caused by a subsequence of events occurring together.

• The length of the anomaly within a sequence is usually not known and varies significantly across different application domains. Techniques typically have to rely on user-defined lengths, which may or may not be optimal.

• Computational complexity is a significant issue for sequence data, especially since sequences can be very long and the alphabet size can be large.

Anomaly detection for discrete sequences has been a focus of many research papers. But most of the anomaly detection techniques have been developed within specific application domains. In particular, several techniques have been developed to detect anomalies in operating system call data [4]– [10]. Budalakoti et al [11], [12] present a study of anomaly detection techniques to detect anomalies in flight safety domain. Sun et al [13] present a probabilistic suffix tree based technique and evaluate it on biological sequences. While each technique has been evaluated within a specific domain, there has not been an attempt to analyze the problem in its entirety. The reason such analysis is essential is that the nature of the sequence data as well as the nature of anomalies can differ fundamentally across domains, and hence while a technique is shown to be effective within one domain, it does not guarantee its effectiveness in a different domain.

Even though the existing techniques appear to have the same objective, i.e., to detect anomalies in discrete sequences, a deeper analysis reveals that different techniques actually address different problem formulations. For example, Forrest et al (4) proposed several techniques that detect an anomalous sequence from a database of unlabeled sequences. On the other hand, Chakrabarti et al (14) detect an anomalous subsequence within a long sequence of events. Gwadera et al [15] address the problem of determining if the frequency of occurrence of a short subsequence in a given long test sequence is anomalous with respect to its expected frequency of occurrence in normal sequences. All of these formulations are fundamentally distinct, and hence require exclusive techniques to solve them. It is not only important to identify which problem formulation is addressed by a given technique, but also to understand how techniques that address one formulation can be adapted to solve a different formulation.

The literature on anomaly detection for sequences is scattered across varied application domains, but there has not been any study that provides a global overview of the literature. A comparison of four different anomaly detection techniques for discrete sequences was presented by Forest et al [4], but all of the techniques focussed on system call intrusion detection. Chandola et al [16] provided a comparative evaluation of eight anomaly detection techniques on sequence data collected from different domains, but only focussed on the problem of detecting anomalous sequences from a database of sequences.

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2012 AnomalyDetectionforDiscreteSequ	Vipin Kumar Varun Chandola Arindam Banerjee			Anomaly Detection for Discrete Sequences: A Survey				10.1109/TKDE.2010.235		2012