2012 MiningEmergingPatternsbyStreami

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Emerging Pattern, Emerging Pattern Mining Task, Pattern Recognition Task, Emerging Patterns by Streaming Feature Selection (EPSF) Mining System, Streaming Feature Selection.

Notes

Cited By

Quotes

Author Keywords

Abstract

Building an accurate emerging pattern classifier with a high-dimensional dataset is a challenging issue. The problem becomes even more difficult if the whole feature space is unavailable before learning starts. This paper presents a new technique on mining emerging patterns using streaming feature selection. We model high feature dimensions with streaming features, that is, features arrive and are processed one at a time. As features flow in one by one, we online evaluate each coming feature to determine whether it is useful for mining predictive emerging patterns (EPs) by exploiting the relationship between feature relevance and EP discriminability (the predictive ability of an EP). We employ this relationship to guide an online EP mining process. This new approach can mine EPs from a high-dimensional dataset, even when its entire feature set is unavailable before learning. The experiments on a broad range of datasets validate the effectiveness of the proposed approach against other well-established methods, in terms of predictive accuracy, pattern numbers and running time.

1. Introduction

An emerging pattern (EP for short) is a pattern whose support value changes significantly from one class to another [9]. Highly accurate classifiers can be built by aggregating the differentiating power of EPs [7, 10].

 Mining EPs is still a daunting problem when the number of feature dimensions can be in the thousands, as it is difficult to store, retrieve, prune, and sort them efficiently for classification with a huge number of candidate EPs. With the advent of emerging massive datasets involved with hundreds of thousands of features, such as in image processing, gene expression data, text data, and so on, this pattern search space is rather huge and even smoothing the full feature space sometimes becomes very costly or simply impossible. Therefore, mining EPs from such a space has to face two challenging research issues: (1) how to efficiently mine a small set of strongly predictive EPs from a high-dimensional dataset; and (2) how to mine strongly predictive EPs from a large feature space as exhaustive search over it is either very time-consuming or simply infeasible.

 In this paper, we propose a new approach to battle these two challenging issues. A novel contribution of our approach is that it uses streaming features to model a high-dimensional feature space, and then integrates streaming feature selection into the EP mining process to help efficient and effective discovery of a small set of strongly predictive EPs in a large feature space yet to get promising performances.

The concept of streaming features has been proposed to handle feature selection in a changing feature space over time [19, 24]. Unlike data streams, with streaming features, feature dimensions are modeled as a feature stream, and features flow in one by one and each feature is processed upon its arrival. Recent research has shown that streaming feature selection is effective and efficient with not only a huge feature space but also an unknown full feature space before learning \19]. However, if we consider streaming feature selection and EP mining as a whole, aggregating all features and samples to mine EPs is a hard research problem:

(1) Online data processing. Since feature dimensions flow in one by one, it is required to online transform, map and partition arriving features. Firstly, converting a real-world dataset into a desired encoded dataset of all items is infeasible before mining starts. Secondly, the mapping between the item numbers and real-world features needs to be constructed and updated as features flow in one by one over time. Thirdly, as a feature is available, we must online divide the data of each class accordingly instead of dividing all data with all features in advance.

(2) Dynamical EP mining. With streaming features, one solution to mine EPs is to employ streaming feature selection to dynamically control the EP mining process. The problem is how to integrate streaming feature selection into this EP mining process to get an accurate EP classifier.

 In this paper, we propose EPSF (mining Emerging Patterns by Streaming Feature selection). More specifically, EPSF assumes that features arrive one at a time, and each feature is online processed upon its arrival. With the online processing, a two-level framework is proposed to handle dynamical EP mining. In the first level, by exploiting the relations between feature relevance and EP discriminability (the predictive ability of an EP), EPSF online builds an influential feature pool by evaluating each arriving feature whether it is useful for mining strongly predictive EPs. In the second level, EPSF online builds a candidate EP pool by using this feature pool to online guide EP mining. As features flow in one by one, by interleaving the two levels, [EPSF]] provides a natural way to integrate streaming feature selection and EP mining for the high feature dimension challenge in EP mining.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 gives the preliminary and Section 4 presents our approach. Section 5 reports our experimental results. Finally, Section 6 provides our conclusion and future work.

2. Related Work

(...)

3. Preliminary Knowledge

(...)

4. Emerging Pattern Mining By Streaming Feature Selection

(...)

4.1 Feature Relevance and EP Discriminability

4.2 Mining Emerging Patterns with Streaming Feature Selection

5. Experimental Results

(...)

5.1 Experimental Setup

5.2 Comparison of Predictive Accuracy

5.3 Comparison of Numbers of Patterns

5.4 Comparison of Running Time

5.5 Analysis on the Predictive Accuracy Under Different Growth Rate Thresholds

5.6 Mining EPs without Smoothing Through the Whole Feature Space

5.7 A Summary of Experimental Results

6. Conclusion And Future Work

In this paper, we explored the relationships between feature relevance and EP discriminability. By employing the relationships, we integrated streaming feature selection to guide a dynamic EP mining process. This new approach can handle not only a large feature space, but also a high—dimensional dataset without knowing its entire feature set in advance. Experimental results have demonstrated the effectiveness and efficiency of our approach. We plan to apply our new approach to real planetary images that can generate infinite texture-based features.

References

  • 1. C. F. Aliferis, I. Tsamardinos, A. Statnikov & L.E. Brown. (2003) Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery. METMBS'03.
  • 2. Roberto J. Bayardo, Jr., Efficiently Mining Long Patterns from Databases, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, p.85-93, June 01-04, 1998, Seattle, Washington, USA doi:10.1145/276304.276313
  • 3. James Bailey, Thomas Manoukian, Kotagiri Ramamohanarao, Fast Algorithms for Mining Emerging Patterns, Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, p.39-50, August 19-23, 2002
  • 4. James Bailey, Thomas Manoukian, Kotagiri Ramamohanarao, A Fast Algorithm for Computing Hypergraph Transversals and Its Application in Mining Emerging Patterns, Proceedings of the Third IEEE International Conference on Data Mining, p.485, November 19-22, 2003
  • 5. C. L. Blake & C. J. Merz. (1998) UCI Repository of Machine Learning Databases.
  • 6. Hongjian Fan, Kotagiri Ramamohanarao, An Efficient Single-Scan Algorithm for Mining Essential Jumping Emerging Patterns for Classification, Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, p.456-462, May 06-08, 2002
  • 7. Hongjian Fan, Kotagiri Ramamohanarao, Fast Discovery and the Generalization of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers, IEEE Transactions on Knowledge and Data Engineering, v.18 n.6, p.721-737, June 2006 doi:10.1109/TKDE.2006.95
  • 8. Gang Fang, Gaurav Pandey, Wen Wang, Manish Gupta, Michael Steinbach, Vipin Kumar, Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data, IEEE Transactions on Knowledge and Data Engineering, v.24 n.2, p.279-294, February 2012 doi:10.1109/TKDE.2010.241
  • 9. Guozhu Dong, Jinyan Li, Efficient Mining of Emerging Patterns: Discovering Trends and Differences, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.43-52, August 15-18, 1999, San Diego, California, USA doi:10.1145/312129.312191
  • 10. Guozhu Dong, Xiuzhen Zhang, Limsoon Wong, Jinyan Li, CAEP: Classification by Aggregating Emerging Patterns, Proceedings of the Second International Conference on Discovery Science, p.30-42, December 01, 1999
  • 11. Ron Kohavi, George H. John, Wrappers for Feature Subset Selection, Artificial Intelligence, v.97 n.1-2, p.273-324, Dec. 1997 doi:10.1016/S0004-3702(97)00043-X
  • 12. Jinyan Li, Guozhu Dong, Kotagiri Ramamohanarao, Making Use of the Most Expressive Jumping Emerging Patterns for Classification, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, p.220-232, April 18-20, 2000
  • 13. Jinyan Li, Guozhu Dong, Kotagiri Ramamohanarao, Instance-Based Classification by Emerging Patterns, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, p.191-200, September 13-16, 2000
  • 14. Wenmin Li, Jiawei Han, Jian Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, Proceedings of the 2001 IEEE International Conference on Data Mining, p.369-376, November 29-December 02, 2001
  • 15. B. Liu, W. Hsu, & Y. Ma. (1998) Integrating Classification and Association Rule Mining. KDD'98, 80--86.
  • 16. David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, Chengnian Sun, Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 01, 2009, Paris, France doi:10.1145/1557019.1557083
  • 17. Elsa Loekito, James Bailey, Fast Mining of High Dimensional Expressive Contrast Patterns Using Zero-suppressed Binary Decision Diagrams, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2006, Philadelphia, PA, USA doi:10.1145/1150402.1150438
  • 18. S. Mao & G. Dong. (2005) Discovery of Highly Differentiative Gene Groups from Microarray Gene Expression Data Using the Gene Club Approach. J. Bioinformatics and Computational Biology, 3(6):1263--1280.
  • 19. X. Wu, K.Yu, H. Wang & W. Ding. (2010) Online Streaming Feature Selection. ICML'10, 1159--1166.
  • 20. X. Yin & J. Han. (2003) CPAR: Classification based on Predictive Association Rule. SDM'03, 369--376.
  • 21. Kui Yu, Xindong Wu, Wei Ding, Hao Wang, Hongliang Yao, Causal Associative Classification, Proceedings of the 2011 IEEE 11th International Conference on Data Mining, p.914-923, December 11-14, 2011 doi:10.1109/ICDM.2011.30
  • 22. Lei Yu, Huan Liu, Efficient Feature Selection via Analysis of Relevance and Redundancy, The Journal of Machine Learning Research, 5, p.1205-1224, 12/1/2004
  • 23. Xiuzhen Zhang, Guozu Dong, Ramamohanarao Kotagiri, Exploring Constraints to Efficiently Mine Emerging Patterns from Large High-dimensional Datasets, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.310-314, August 20-23, 2000, Boston, Massachusetts, USA doi:10.1145/347090.347158
  • 24. Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar, Streamwise Feature Selection, The Journal of Machine Learning Research, 7, p.1861-1885, 12/1/2006

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 MiningEmergingPatternsbyStreamiXindong Wu
Kui Yu
Wei Ding
Dan A. Simovici
Mining Emerging Patterns by Streaming Feature Selection10.1145/2339530.23395442012