2002 OntheNeedforTimeSeriesDataMinin

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Time-Series Dataset Data Mining Task.

Notes

Cited By

Quotes

Author Keywords

Abstract

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.

1. INTRODUCTION

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of the work in the literature suffers from two types of experimental flaws, implementation bias and data bias (defined in detail below). Because of these flaws, much of the work has very little generalizability to real world problems.

In particular, we claim that many of the contributions made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.

In order to support our claim we have conducted the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than 25 papers and testing them on 50 real word datasets. Our results strongly support our contention. We are anxious that this work should not be taken as been critical of the data mining community. We note that several papers by the current first author are among the worst offenders in terms of weak experimental evaluation. While preparing the survey we read more than 340 data mining papers and we were struck by the originality and diversity of approaches that researchers have used to attack very difficult problems. Our goal is simply to demonstrate that empirical evaluations in the past have often been inadequate, and we hope this work will encourage more extensive experimental evaluations in the future.

For concreteness we begin by defining the various tasks that occupy the attention of most time series data mining research.

Note that segmentation has two major uses. It may be performed in order to determine when the underlying model that created the time series has changed [19, 20], or segmentation may simply be performed to created a high level representation of the time series that supports indexing, clustering and classification [20, 30, 31, 37, 39, 42, 44, 46, 48, 52, 57].

As mentioned above, our experiments were conducted on 50 real world, highly diverse datasets. Space limitations prevent us from describing all 50 datasets in detail, so we simply note the following. The data represents the many areas in which time series data miners have investigated, including finance, medicine, biometrics, chemistry, astronomy, robotics, networking and industry. We also note that all data and code used in this paper is available for free by emailing the first author.

The rest of this paper is organized as follows. In Section 2 we survey the literature on time series data mining, and summarize some statistics about the empirical evaluations. In Section 3, we consider the indexing problem, and demonstrate with extensive experiments that many of the published results do not generalized to real world problems. Section 4 considers the problem of evaluating time series classification and clustering algorithms. In Section 5 we show that similar problems occur for evaluation of segmentation algorithms. Finally in Section 6 we summarize our findings and offer concrete suggestions to improve the quality of evaluation of time series data mining algorithms.

5. SEGMENTATION

A large fraction of the papers in the survey either introduce a segmentation algorithm as their main contribution, or utilize a segmentation algorithm as a subroutine. Although the segments created could be polynomials of an arbitrary degree, the most common representation of the segments are linear functions. Intuitively a Piecewise Linear Representation (PLR) refers to the approximation of a time series Q, of length n, with K straight lines. Figure 8 contains an example.

Figure 8. An example of a time series with its piecewise linear representation

Because K is typically much smaller that n, this representation makes the storage, transmission and computation of the data more efficient. Specifically, in the context of data mining, piecewise linear representation has been used to:

Surprisingly, in spite of the ubiquity of this representation, with the exception of [52], there has been little attempt to understand and compare the algorithms that produce it.

Although appearing under different names and with slightly different implementation details, most time series segmentation algorithms can be grouped into one of the following three categories.

We can measure the quality of a segmentation algorithm in several ways, the most obvious of which is to measure the reconstruction error for a fixed number of segments. The reconstruction error is simply the Euclidean distance between the original data and the segmented representation.

References

  • 1. Rakesh Agrawal, Christos Faloutsos, Arun N. Swami, Efficient Similarity Search In Sequence Databases, Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, p.69-84, October 13-15, 1993
  • 2. Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, Kyuseok Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proceedings of the 21th International Conference on Very Large Data Bases, p.490-501, September 11-15, 1995
  • 3. Rakesh Agrawal, Giuseppe Psaila, Edward L. Wimmers, Mohamed Zaït, Querying Shapes of Histories, Proceedings of the 21th International Conference on Very Large Data Bases, p.502-514, September 11-15, 1995
  • 4. Henrik André-Jönsson, Dushan Z. Badal, Using Signature Files for Querying Time-Series Data, Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, p.211-220, June 24-27, 1997
  • 5. Bailey, D. (1991). Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review, Aug. 1991, Pp. 54--55.
  • 6. Bay, S. (1999). UCI Repository of Kdd Databases {http://kdd.ics.uci.edu/}. Irvine, CA: University of California, Department of Information and Computer Science
  • 7. Donald J. Berndt, James Clifford, Finding Patterns in Time Series: A Dynamic Programming Approach, Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, Menlo Park, CA, 1996
  • 8. Tolga Bozkaya, Nasser Yazdani, Meral Özsoyoğlu, Matching and Indexing Sequences of Different Lengths, Proceedings of the Sixth International Conference on Information and Knowledge Management, p.128-135, November 10-14, 1997, Las Vegas, Nevada, United States doi:10.1145/266714.266880
  • 9. Juan P. Caraça-Valente, Ignacio López-Chavarrías, Discovering Similar Patterns in Time Series, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.497-505, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347192
  • 10. Efficient Time Series Matching by Wavelets, Proceedings of the 15th International Conference on Data Engineering, p.126, March 23-26, 1999
  • 11. Kelvin Kam Wing Chu, Man Hon Wong, Fast Time-series Searching with Scaling and Shifting, Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, p.237-248, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States doi:10.1145/303976.304000
  • 12. Cohen, W. (1993). Efficient Pruning Methods for Separate-and-conquer Rule Learning Systems. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France. Pp 88--994.
  • 13. Gautam Das, Dimitrios Gunopulos, Heikki Mannila, Finding Similar Time Series, Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, p.88-100, June 24-27, 1997
  • 14. Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series. In: Proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. Pp 16--22.
  • 15. Debregeas, A. & Hebrail, G. (1998). Interactive Interpretation of Kohonen Maps Applied to Curves. In: Proceedings of the 4th Int'l Conference of Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. Pp 179--183.
  • 16. C. Faloutsos, H. Jagadish, A. Mendelzon, T. Milo, A Signature Technique for Similarity-Based Queries, Proceedings of the Compression and Complexity of Sequences 1997, p.2, June 11-13, 1997
  • 17. Christos Faloutsos, M. Ranganathan, Yannis Manolopoulos, Fast Subsequence Matching in Time-series Databases, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, p.419-429, May 24-27, 1994, Minneapolis, Minnesota, United States doi:10.1145/191839.191925
  • 18. Hakan Ferhatosmanoglu, Ertem Tuncel, Divyakant Agrawal, Amr El Abbadi, Approximate Nearest Neighbor Searching in Multimedia Databases, Proceedings of the 17th International Conference on Data Engineering, p.503-511, April 02-06, 2001
  • 19. Martin Gavrilov, Dragomir Anguelov, Piotr Indyk, Rajeev Motwani, Mining the Stock Market (extended Abstract): Which Measure is Best?, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.487-496, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347189
  • 20. Xianping Ge, Padhraic Smyth, Deformable Markov Model Templates for Time-series Pattern Matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.81-90, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347109
  • 21. Pierre Geurts, Pattern Extraction for Time Series Classification, Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, p.115-127, September 03-05, 2001
  • 22. Dina Q. Goldin, Paris C. Kanellakis, On Similarity Queries for Time-Series Data: Constraint Specification and Implementation, Proceedings of the First International Conference on Principles and Practice of Constraint Programming, p.137-153, September 19-22, 1995
  • 23. Valery Guralnik, Jaideep Srivastava, Event Detection from Time Series Data, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.33-42, August 15-18, 1999, San Diego, California, United States doi:10.1145/312129.312190
  • 24. Yun-Wu Huang, Philip S. Yu, Adaptive Query Processing for Time-series Data, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.282-286, August 15-18, 1999, San Diego, California, United States doi:10.1145/312129.318357
  • 25. Huhtala, Y., Kärkkäinen, J. & Toivonen, H. (1999). Mining for Similarities in Aligned Time Series Using Wavelets. Data Mining and Knowledge Discovery: Theory, Tools, and Technology, SPIE Proceedings Series, Vol. 3695. Orlando, FL, Apr. Pp 150--160.
  • 26. Piotr Indyk, Nick Koudas, S. Muthukrishnan, Identifying Representative Trends in Massive Time Series Data Sets Using Sketches, Proceedings of the 26th International Conference on Very Large Data Bases, p.363-372, September 10-14, 2000
  • 27. Tamer Kahveci, Ambuj K. Singh, Variable Length Queries for Time Series Data, Proceedings of the 17th International Conference on Data Engineering, p.273-282, April 02-06, 2001
  • 28. , A. Singh, T. Kahvec, A. Gü, An Efficient Index Structure for Shift and Scale Invariant Search of Multi-Attribute Time Sequences, Proceedings of the 18th International Conference on Data Engineering, p.266, February 26-March 01, 2002
  • 29. Konstantinos Kalpakis, Dhiral Gada, Vasundhara Puttagunta, Distance Measures for Effective Clustering of ARIMA Time-Series, Proceedings of the 2001 IEEE International Conference on Data Mining, p.273-280, November 29-December 02, 2001
  • 30. Keogh, E. & Pazzani, M. (1998). An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering Artd Relevance Feedback. In: Proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. Pp 239--241.
  • 31. Keogh, E. & Smyth, P. (1997). A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. In: Proceedings of the 3rd Int'l Conference on Knowledge Discovery and Data Mining. Newport Beach, CA, Aug 14--17. Pp 24--20.
  • 32. Eamonn Keogh, Kaushik Chakrabarti, Michael Pazzani, Sharad Mehrotra, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, p.151-162, May 21-24, 2001, Santa Barbara, California, United States doi:10.1145/375663.375680
  • 33. Kibler, D., & Langley, P. (1988). Machine Learning As An Experimental Science. In: Proceedings of the 3rd European Working Session on Learning. Pp. 81--92
  • 34. Edward Kim, Joyce M. W. Lam, Jiawei Han, AIM: Approximate Intelligent Matching for Time Series Data, Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery, p.347-357, September 04-06, 2000
  • 35. Flip Korn, H. V. Jagadish, Christos Faloutsos, Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences, Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, p.289-300, May 11-15, 1997, Tucson, Arizona, United States doi:10.1145/253260.253332
  • 36. Sze Kin Lam, Man Hon Wong, A Fast Projection Algorithm for Sequence Data Searching, Data & Knowledge Engineering, v.28 n.3, p.321-339, Dec. 15, 1998 doi:10.1016/S0169-023X(98)00023-8
  • 37. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. & Allan, J. (2000). Mining of Concurrent Text and Time Series. In: Proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining. Boston, MA, Aug 20--23. Pp 37--44.
  • 38. Similarity Search for Multidimensional Data Sequences, Proceedings of the 16th International Conference on Data Engineering, p.599, February 28-March 03, 2000
  • 39. Chung-Sheng Li, Philip S. Yu, Vittorio Castelli, MALM: A Framework for Mining Sequence Database at Multiple Abstraction Levels, Proceedings of the Seventh International Conference on Information and Knowledge Management, p.267-272, November 02-07, 1998, Bethesda, Maryland, United States doi:10.1145/288627.288666
  • 40. Woong-Kee Loh, Sang-Wook Kim, Kyu-Young Whang, Index Interpolation: An Approach to Subsequence Matching Supporting Normalization Transform in Time-series Databases, Proceedings of the Ninth International Conference on Information and Knowledge Management, p.480-487, November 06-11, 2000, McLean, Virginia, United States doi:10.1145/354756.354856
  • 41. Chihcheng Hsu, Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases, Proceedings of the 16th International Conference on Data Engineering, p.23, February 28-March 03, 2000
  • 42. Sanghyun Park, Sang-Wook Kim, Wesley W. Chu, Segment-based Approach for Subsequence Searches in Sequence Databases, Proceedings of the 2001 ACM Symposium on Applied Computing, p.248-252, March 2001, Las Vegas, Nevada, United States doi:10.1145/372202.372334
  • 43. Sanghyun Park, Dongwon Lee, Wesley W. Chu, Fast Retrieval of Similar Subsequences in Long Sequence Databases, Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, p.60, November 07-07, 1999
  • 44. Polly Wan Po Man, Man Hon Wong, Efficient and Robust Feature Extraction and Pattern Matching of Time Series by a Lattice Structure, Proceedings of the Tenth International Conference on Information and Knowledge Management, October 05-10, 2001, Atlanta, Georgia, USA doi:10.1145/502585.502631
  • 45. Similarity Search Over Time-Series Data Using Wavelets, Proceedings of the 18th International Conference on Data Engineering, p.212, February 26-March 01, 2002
  • 46. Pratt, K. B. & Fink, E. (2002). Search for Patterns in Compressed Time Series. Int'l Journal of Image and Graphics. to Appear.
  • 47. Prechelt. L. (1995). A Quantitative Study of Neural Network Learning Algorithm Evaluation Practices. In: Proceedings of the 4th Int'l Conference on Artificial Neural Networks. Pp. 223--227.
  • 48. Yunyao Qu, Changzhou Wang, X. Sean Wang, Supporting Fast Search in Time Series for Movement Patterns in Multiple Scales, Proceedings of the Seventh International Conference on Information and Knowledge Management, p.251-258, November 02-07, 1998, Bethesda, Maryland, United States doi:10.1145/288627.288664
  • 49. Rafiei, D. & Mendelzon, A. O. (1998). Efficient Retrieval of Similar Time Sequences Using DFT. In: Proceedings of the 5th Int'l Conference on Foundations of Data Organization and Algorithms. Kobe, Japan, Nov 12--13.
  • 50. On Similarity-Based Queries for Time Series Data, Proceedings of the 15th International Conference on Data Engineering, p.410, March 23-26, 1999
  • 51. Cyrus Shahabi, Xiaoming Tian, Wugang Zhao, TSA-Tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries on Time-Series Data, Proceedings of the 12th International Conference on Scientific and Statistical Database Management (SSDBM'00), p.55, July 26-28, 2000 doi:10.1109/SSDM.2000.869778
  • 52. Hagit Shatkay, Stanley B. Zdonik, Approximate Queries and Representations for Large Data Sequences, Proceedings of the Twelfth International Conference on Data Engineering, p.536-545, February 26-March 01, 1996
  • 53. Simon, J. L. (1994). What Some Puzzling Problems Teach About the Theory of Simulation and the Use of Resampling. The American Statistician, Vol. 48(4). Nov. Pp 1--4.
  • 54. Zbigniew R. Struzik, Arno Siebes, The Haar Wavelet Transform in the Time Series Similarity Paradigm, Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, p.12-22, September 15-18, 1999
  • 55. Walker, J. (2001). HotBits: Genuine Random Numbers Generated by Radioactive Decay. Www.fourrnilab.ch/hotbits/
  • 56. Wang, C. & Wang, X. S. (2000). Multilevel Filtering for High Dimensional Nearest Neighbor Search. In: Proceedings of ACM SIGMOD Workshop on Research Lssues in Data Mining and Knowledge Discovery. Dallas, TX, May 14. Pp 37--43.
  • 57. Changzhou Wang, Xiaoyang Sean Wang, Supporting Content-Based Searches on Time Series via Approximation, Proceedings of the 12th International Conference on Scientific and Statistical Database Management (SSDBM'00), p.69, July 26-28, 2000 doi:10.1109/SSDM.2000.869779
  • 58. Changzhou Wang, X. Sean Wang, Supporting Subseries Nearest Neighbor Search via Approximation, Proceedings of the Ninth International Conference on Information and Knowledge Management, p.314-321, November 06-11, 2000, McLean, Virginia, United States doi:10.1145/354756.354834
  • 59. Leejay Wu, Christos Faloutsos, Katia P. Sycara, Terry R. Payne, FALCON: Feedback Adaptive Loop for Content-Based Retrieval, Proceedings of the 26th International Conference on Very Large Data Bases, p.297-306, September 10-14, 2000
  • 60. Yi-Leh Wu, Divyakant Agrawal, Amr El Abbadi, A Comparison of DFT and DWT based Similarity Search in Time-series Databases, Proceedings of the Ninth International Conference on Information and Knowledge Management, p.488-495, November 06-11, 2000, McLean, Virginia, United States doi:10.1145/354756.354857
  • 61. Byoung-Kee Yi, Christos Faloutsos, Fast Time Sequence Indexing for Arbitrary Lp Norms, Proceedings of the 26th International Conference on Very Large Data Bases, p.385-394, September 10-14, 2000
  • 62. Byoung-Kee Yi, H. V. Jagadish, Christos Faloutsos, Efficient Retrieval of Similar Time Sequences Under Time Warping, Proceedings of the Fourteenth International Conference on Data Engineering, p.201-208, February 23-27, 1998,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 OntheNeedforTimeSeriesDataMininEamonn Keogh
Shruti Kasetty
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration10.1145/775047.775062