- (Melli et al., 2003) ⇒ Gabor Melli, Siavash Amirrezvani, Felix Chen, and Neil Russell. (2003). “Column Reduction During Progressive Sampling.” In: Workshop on Data Mining for Actionable Knowledge (DMAK 2003).
Sampling and column (variable) selection are commonly used to gain insights that improve data mining performance on very large databases. This paper proposes an algorithm named CRPS (for Column Removal during Progressive Sampling) that integrates sampling and column reduction. This algorithm delivers a predictive model in less time than current approaches in tasks where not all columns are relevant. CRPS works in conjunction with predictive modeling algorithms that decide which columns to include in their models such as decision tree and logistic regression. The CRPS algorithm iteratively models on progressively larger samples and also removes some of the columns that were not used in the previous sample in order to reduce the amount of overall work. Other advantages of CRPS include the requirement of less memory to process the dataset, a ranking of relevance for each column, and the generation of simpler models based on fewer columns. The algorithm's time complexity is dependent on the number of columns in the final model and not the number of columns in the dataset. In the worst case when all columns are required by the model, CRPS does not increase the time complexity of progressive sampling, but could double the time of directly modeling all of the data. In more practical cases where some of columns are not relevant to the task, CRPS offers an advantage over traditional predictive modeling. Empirical results validate CRPS’s efficiency with no degradation in accuracy.
- A. Agresti (1990), Categorical Data Analysis. New York: John Wiley.
- Aha, D. (1998), Feature weighting for lazy learning algorithms. In: H. Liu and H. Motoda (Eds.) Feature Extraction, Construction and Selection: A Data Mining Perspective. Norwell MA: Kluwer.
- Bellman, R. E. (1961), Adaptive Control Processes. Princeton University Press, Princeton, NJ.
- Cardie, C. (1993), Using decision trees to improve case-based learning, in: Proceedings of the Tenth International Conference on Machine Learning, Morgani Kaufmann Publishers, Inc., pp. 25-32.
- Pedro Domingos & Pazzani, M. (1996), Beyond independence: Conditions for the optimality of simple Bayesian classifier. In Machine Learning: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann.
- Friedman, J. H. (1997), Data mining and statistics: What's the connection? In: Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics.
- John, G., & Langley, P. (1996), Static versus dynamic sampling for data mining. Proceedings of Second Intrl. Conference on Knowledge Discovery and Data Mining pp. 367-370. Portland, OR: AAAI Press.
- Ron Kohavi and John, G. (1997), Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273-324.
- Kononenko, I. (1994) Estimation attributes: Analysis and extensions of Relief. In Bergadano, F. and Raedt, L. D., editors, Proceedings of the European Conference on Machine Learning.
- Langley, P. (1994), Selection of relevant features in machine learning, in AAAI Fall Symposium on Relevance, pp. 140--144.
- Murphy, P. M., and Aha, D.W. (1999), UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, 1999.
- Oates, T., and Jensen, D. (1998), Large datasets lead to overly complex models: an explanation and a solution. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pp. 294-298, Rakesh Agrawal and Pl Stolorz, Eds., Menlo Park, CA: AAAI Press.
- Provost, F., and Kolluri, V. (1999), A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 2
- Provost, F., Jensen, D., and Oates, T. (1999), Efficient Progressive Sampling. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99), pp. 23-32, San Diego, CA. ACM Press.
- J. Ross Quinlan (1993). C4.5: Programs for Machine Leanring. San Mateo, CA: Morgan Kaufmann.