2015 DataMiningTheTextbook

Jump to: navigation, search

Subject Headings: Data Mining Textbook; AMSSketch; Active Learning; Co-Clustering; k-Means Clustering; Aggregate Change Points as Outliers; Algorithm Variations and Refinements; Similarity-based Application; Approximate Frequent Patterns; Approximation in Terms of Itemsets; Approximation in Terms of Transactions; Assessing Model Effectiveness; Association Pattern Mining; Association Rule Generation Framework; Associative Classifiers; Autoregressive Model; Autoregressive Moving Average Model; BIRCH; BOAT; Bag-of-Words Kernel; Bagging; Bayes Classifiers; Betweenness Centrality; Bibliographic Notes; Binary and Set Data; Binary and Set-Valued Data; Bioinformatics; Biological and Medical Applications; Bisecting k-Means; Bloom Filter; Boosting; Bootstrap; Bottom-up Agglomerative Method; CLARANS; CLIQUE; CURE; Categorical Data; Categorical and Mixed Attribute Data; Categorical to Numeric Data: Binarization; Categorization by Component Independence; Categorization by Constituent Components; Centroid-based Classification; Data-Driven Classification; Classification of Shapes; Classification with Complex Data Types; Classifier Evaluation; Closed Patterns; Closeness Centrality and Proximity Prestige; Clu Stream Algorithm; Cluster Ensembles; Cluster Validation; Clustering Categorical Data; Clustering Data Streams; Clustering Method; Clustering Shapes; Clustering and Distance-based Method; Clustering for Outlier Detection; Clustering with Complex Data Types; Co-Training; Co-clustering; Co-location Patterns; Collective Classification; Collective Strength; Combatting Spider Traps; Combination Outliers; Combining Different Ensemble Components; Common Mistakes; Community Detection; Comparing Various Linear Model; Comparison with Singular Value Decomposition; Comparison with other Linear Model; Computational Considerations; Computing Similarity between Trajectories; Connections with Random-Walk Method; Constrained Sequential Pattern Mining; Content-based Recommendations; Converting Trajectories to Multidimensional Data; Cosine Coefficient on Columns; Count-Min Sketch; Cross-Validation; Customer Recommendations; Customer Segmentation and Collaborative Filtering; DBSCAN; DENCLUE; Data Classification; Data Cleaning; Data Clustering; Data Partitioned Ensemble; Data Preprocessing; Data Reduction and Transformation; Data Summarization; Data Transformation and Reduction; Data Type Portability; Data-centered Ensemble; Decision Trees; Degree Centrality and Prestige; Demographic and Profile Analysis; Density-based Method; Dependency Oriented Data; Depth-based Method; Design Variations of Nearest Neighbor Classifiers; Dimensionality Reduction; Dimensionality Reduction with Axis Rotation; Dimensionality Reduction with Type Transformation; Discrete Fourier Transform; Discrete Sequence Similarity Measure; Discrete Sequence to Numeric Data; Discrete Sequences and Strings; Discrete Wavelet Transform; Semisupervised Learning; Distance-based Method; Distance-based Model; Distance-based Motifs; Distance-based Outlier Detection; Distributed Privacy; Document Normalization and Similarity Computation; Document Preparation and Similarity Computation; Dynamic Time Warping Distance; Dynamics of Network Formation; Early Termination Trick with Nested Loops; Earth Science Applications; Edge-based Join Growth; Edit Distance; Efficiency Issues: Probabilistic Suffix Trees; Efficient Support Counting; Embedded Model; Ensemble Method; Ensemble Method; Entropy; Enumeration-Tree Algorithms; Enumeration-Tree-based Interpretation of Apriori; Equivalence of Trajectories and Multivariate Time Series; Evaluation: Computing the Fit Probability for Observed Sequence; Example Re-weighting; Exercises; Expected Error Reduction; Expected Model Change; Expected Variance Reduction; Explanation: Determining the Most Likely State Sequence for Observed Sequence; External Validation Criteria; Extreme Value Analysis; Feature Extraction; Feature Extraction and Portability; Feature Selection for Classification; Feature Selection for Clustering; Feature Subset Selection; Filter Model; Financial Fraud and Anomalous Events; Fisher Score; Fisher’s Linear Discriminant; Flajolet-Martin Algorithm for Distinct Element Counting; Formal Definition and Techniques for HMMs; Formal Statement of Bias-Variance Trade-off; Frequency-based Model; Frequent Itemset Mining Algorithm; Frequent Pattern Mining in Data Streams; Frequent Pattern Mining to Graph Pattern Mining; Frequent Patterns to Frequent Sequences; Frequent Substructure Mining in Graphs; Frequent Substructure-based Method; Frequent Substructure-based Transformation and Distance Computation; Frequent Trajectory Paths; From Decision Trees to Regression Trees; General Comments; Generalization to Other Data Types; Generalized Linear Model; Generic Meta-Algorithm; Generic Transformational Approach; Gini Index; Girvan-Newman Algorithm; Graph Classification; Graph Clustering; Graph Edit Distance; Graph Matching Methods for Distance Computation; Graph Regularization Approach; Graph Similarity Measure; Graph-based Algorithm; Graph-based Method; Graph-based Semisupervised Learning; Graphs to Numeric Data; Grid Search for Subspace Outliers; Grid-based Method; Grid-based Rare Subspace Exploration; Group-based Statistics; HITS; Haar Wavelet Transform; Handling Concept Drift; Handling Incorrect and Inconsistent Entries; Handling Missing Entries; Handling Missing Values; Heterogeneity-based Model; Hidden Markov Model; Hierarchical Clustering Algorithm; Hierarchical Method; High-Dimensional Clustering; High-Dimensional Outlier Detection; Histogram- and Grid-based Techniques; Holdout; Homophily; Hopkins Statistic; Human and Visually Supervised Clustering; Hypergraph Partitioning Algorithm; Impact of Behavioral Attribute Normalization; Impact of Complex Data Types on Problem Definitions; Impact of Data Distribution; Impact of Different Lp-norm; Impact of Domain-specific Relevance; Impact of High Dimensionality; Impact of Local Data Distribution; Impact of Locally Irrelevant Feature; Implementation with Arrays but no Pointers; Implementation with Pointers and F P-Tree; Implementation with Pointers but no F P-Tree; Important Observations and Intuitions; Incognito; Independent Cascade Model; Independent Ensemble; Individual Data Points as Outliers; Influence Function Evaluation; Information-Theoretic Model; Instance-based Classifiers; Instance-based Learning; Instance-specific Mahalanobis Distance; Interest Ratio; Internal Validation Criteria; Intrusion Detection Applications; Item-based Similarity with Ratings; Iterative Classification Algorithm; Iterative Label Propagation: The Spectral Interpretation; Jaccard Coefficient and the Min-hash Trick; Katz Measure; Kernel Density Estimation; Kernel Support Vector Machine; Kernel-based Transformations and Computation; Kernighan-Lin Algorithm; Label Propagation with Random Walks; Latent Factor Model; Latent Semantic Analysis; Learn-One-Rule; Leveraging Aggregate Distributions for Data Mining; Leveraging Data Structures for Querying; Leveraging Latent Semantic Analysis; Leveraging Synopsis Structure; Leveraging the Itemset Lattice; Limitations of PLSA; Linear Regression; Linear Threshold Model; Link Prediction; Link Prediction as a Classification Problem; Link Prediction as a Missing Value Estimation Problem; Local Distance Correction Method; Local Outlier Factor (LOF); Logistic Regression; Longest Common Subsequence; Lossy Counting Algorithm; Lp-norm; Market Basket Analysis; Markovian Similarity-based Algorithm: CLUSEQ; Massive-Domain Stream Clustering; Massive-Domain Streaming Classification; Match-based Similarity Computation; Matching and Distance Computation in Graphs; Matrix Factorization; Maximal Patterns; Maximum Common Subgraph Problem; Maximum Common Subgraph-based Distances; Measures of Centrality; Measures of Prestige; Medical Diagnosis; Meta-clustering Algorithm; Micro-clustering Algorithm; Micro-clustering Method; Mining with Contextual Spatial Attributes; Mixed Quantitative and Categorical Data; Mixture of Hidden Markov Model; Model-centered Ensemble; Modeling Abnormal Lower Dimensional Projections; Mondrian Multidimensional k-Anonymity; Multiclass Learning; Multidimensional Data; Multidimensional Scaling; Multilayer Neural Networks; Multilevel Graph Partitioning: METIS; Multimedia Applications; Multinomial Bayes Model; Multiple Threads; Multivariate Extreme Values; Multivariate Forecasting with Hidden Variables; Naive Bayes Classifier; Nearest Neighbor Classifier; Nearest Neighbors with Linear Discriminant Analysis; Neighborhood-based Measure; Neighborhood-based Methods for Collaborative Filtering; Network and Graph Data; Neural Networks; Node-based Join Growth; Noise Removal; Non-dependency Oriented Data; Nonlinear Distributions: ISOMAP; Nonlinear Support Vector Machine; Nonlinear and Polynomial Regression; Nonnegative Matrix Factorization; Normalization; Normalization and Combination; Novelty and First-Story Detection; Numeric to Categorical Data: Discretization; ORCLUS; Online Clustering of Co-evolving Series; Other Applications for Complex Data Types; Other Applications of Kernel Method; Outlier Analysis; Outlier Detection; Outlier Detection in Sequences; Outlier Detection with Categorical Data; Outlier Detection with Complex Data Types; Outlier Ensembles; Outlier Validity; Output Privacy; Output as Class Labels; Output as Numerical Score; PROCLUS; Page Rank; Pairwise Supervision; Parameter Tuning with Internal Measure; Pattern Mining with Complex Data Types; Pattern Querying; Pattern Summarization; Periodic Patterns; Point Outliers; Pointwise Supervision; Position Outliers; Power-Law Degree Distributions; Practical Issues; Predictive Attribute Dependence; Preferential Crawlers; Preprocess-once Query-many Paradigm; Principal Component Analysis; Principal Component Regression; Privacy during Data Collection; Privacy-Preserving Data Publishing; Probabilistic Classifiers; Probabilistic Clustering; Hidden Markov Model; Probabilistic Model-based Algorithm; Probabilistic Model; Pruning Method; Pushing Constraints into Pattern Mining; Putting Associations to Work: Applications; Putting Clustering to Work: Applications; Putting Outliers to Work: Applications; Pyramidal Time Frame; Quality Control and Fault Detection; Quantification Issues; Quantitative Data; Quantitative Multidimensional Data; Query-by-Committee; ROCK; Rain Forest; Random Forests; Random Subspace Sampling; Random Walk-based Measure; Random Walk-based Similarity; Random-Walk Kernels; Rank Centrality and Prestige; Ranking Algorithm; Rare Class Learning; Receiver Operating Characteristic; Recommendations and Collaborative Filtering; Recommender System; Reconstructing Aggregate Distributions; Recursive Suffix-based Pattern Growth Method; Regression Modeling with Numeric Classes; Representative-based Algorithm; Representativeness-based Model; Reservoir Sampling; Reservoir Sampling for Data Streams; Rocchio Classification; Rule Generation from Decision Trees; Rule Pruning; Rule-based Classifiers; Rule-based Method; STREAMAlgorithm; SVMClassifiers for High-dimensional and Sparse Data; Samarati’s Algorithm; Sampling; Sampling Method; Sampling for Static Data; Scalability Issues and the Streaming Scenario; Scalable Classification; Scalable Data Clustering; Scalable Decision Trees; Scalable Support Vector Machine; Scaling and Normalization; Scatter/ Gather Approach; Search Engine Indexing and Query Processing; Selecting Different Ensemble Components; Self-Training; Semisupervised Bayes Classification with E M; Semisupervised Clustering; Semisupervised Learning; Sequence Classification; Sequence Clustering; Sequence-based Method; Sequential Covering Algorithm; Sequential Ensembles; Sequential Pattern Mining; Shape Outliers; Shape to Time-Series Transformation; Shape-based Clustering; Shingling for Near Duplicate Detection; Shortest-Path Kernels; Sim Rank; Similarity Search and Indexing; Similarity between Two Graphs; Similarity between Two Nodes in a Single Graph; Similarity-based Clustering Method; Simultaneous Document and Word Cluster Discovery; Single-Layer Neural Network: The Perceptron; Singular Value Decomposition; Sketches; Social Influence Analysis; Social Network Analysis; Social Networks: Preliminaries and Properties; Solving the Lagrangian Dual; Spatial Co-location Patterns; Spatial Data; Spatial to Multidimensional Transformation with Wavelets; Spatial to Numeric Data; Specialized Classification Methods for Text; Specialized Clustering Methods for Text; Specialized Preprocessing for Web Documents; Spectral Clustering; Spectral Transformation and Embedding of Graphs; Spectrum Kernel; Speeding up Kernighan-Lin; Split Criteria; Stacking; Statistical Coefficient of Correlation; Stopping Criterion and Pruning; Store Product Placement; Streaming Classification; Streaming Outlier Detection; Structural Distance-based Measure; Subsequence-based Clustering; Supervised Event Detection; Supervised Feature Generation with Spectral Embedding; Supervised Micro-cluster Approach; Supervised Similarity Functions; Supervised Spectral Method; Support Vector Machine; Support Vector Machines for Linearly Separable Data; Support Vector Machines with Soft Margin for Nonseparable Data; Symbolic Aggregate Approximation (SAX); Symmetric Confidence Measure; Synopsis Data Structures for Streams; Synopsis Structures for the Massive-Domain Scenario; Synthetic Data Generation: Condensation-based Approach 651; Synthetic Over-sampling: SMOTE; Temporal Similarity Measure; Term Strength; Text Applications; Text Data; Text Similarity Measure; ?-diversity Model; Analytical Phase; Apriori Algorithm; Basic Data Types; The Curse of Dimensionality; Data Mining Process; Data Preprocessing Phase; Frequent Pattern Mining Model; Kernel Trick; Kernel k-Means Algorithm; Ranking Model for Classification; k-Means Algorithm; k-Medians Algorithm; k-Medoids Algorithm; k-anonymity Model; t-closeness Model; Time Series to Discrete Sequence Data; Time Series to Numeric Data; Time-Series Classification; Time-Series Clustering; Time-Series Data; Time-Series Forecasting; Time-Series Motifs; Time-Series Outlier Detection; Time-Series Preparation and Similarity; Time-Series Similarity Measure; Top-down Divisive Method; Topic Modeling; Topic-Sensitive Page Rank; Topological Descriptors; Trade-offs with Different Data Structure; Training a Logistic Regression Classifier; Training: Baum-Welch Algorithm; Trajectory Classification; Trajectory Clustering; Trajectory Clustering as a Sequence Clustering Problem; Trajectory Mining; Trajectory Outlier Detection; Trajectory Pattern Mining; Transductive Support Vector Machine; Transformation to Sequential Pattern Mining; Transformation-based Distance Computation; Tree Projection and Depth Project; Triadic Closure and Clustering Coefficient; Ullman’s Algorithm for Subgraph Isomorphism; Uncertainty Sampling; Univariate Extreme Value Analysis; Unsupervised Mahalanobis Metric; User-based Similarity with Ratings; VFDTFamily; Vertical Counting Method; Visual Clustering; Wavelet-based Rules; Web Crawling and Resource Discovery; Web Log Analytics; Web Log Anomalies; Web Usage Mining; Weighted Degree Kernel; Whole-Series Classification; Why does Ensemble Analysis Work?; Window-based Method; Wrapper Model; X Proj; Direct Clustering with Frequent Subgraph Discovery; X Rules; k-Means; k-Medoids; k-Medoids Clustering; k-Modes Clustering;


Cited By



This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues. It goes beyond the traditional focus on data mining problems to introduce advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. Until now, no single book has addressed all these topics in a comprehensive and integrated way. The chapters of this book fall into one of three categories: Fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems. Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor. Appropriate for both introductory and advanced data mining courses, Data Mining: The Textbook balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners (including those with a limited mathematical background). Numerous illustrations, examples, and exercises are included, with an emphasis on semantically interpretable examples.

Table of Contents

1 An Introduction to Data Mining

1.1 Introduction p.1
1.2 The Data Mining Process p.3
1.2.1 The Data Preprocessing Phase p.5
1.2.2 The Analytical Phase p.6
1.3 The Basic Data Types p.6
1.3.1 Non-dependency Oriented Data p.6 Quantitative Multidimensional Data p.7 Categorical and Mixed Attribute Data p.7 Binary and Set Data p.8 Text Data p.8
1.3.2 Dependency Oriented Data p.9 Time-Series Data p.9 Discrete Sequences and Strings p.10 Spatial Data p.11 Network and Graph Data p.12
1.4 The Major Building Blocks: A Bird’s Eye View p.13
1.4.1 Association Pattern Mining p.14
1.4.2 Data Clustering p.15
1.4.3 Outlier Detection p.16
1.4.4 Data Classification p.17
1.4.5 Impact of Complex Data Types on Problem Definitions p.18 Pattern Mining with Complex Data Types p.18 Clustering with Complex Data Types p.19 Outlier Detection with Complex Data Types p.19 Classification with Complex Data Types p.20
1.5 Scalability Issues and the Streaming Scenario p.20
1.6 A Stroll through some Application Scenarios p.21
1.6.1 Store Product Placement p.21
1.6.2 Customer Recommendations p.21
1.6.3 Medical Diagnosis p.22
1.6.4 Web Log Anomalies p.22
1.7 Summary p.23
1.8 Bibliographic Notes p.23
1.9 Exercises p.24

2 Data Preparation

2.1 Introduction p.25
2.2 Feature Extraction and Portability p.26
2.2.1 Feature Extraction p.26
2.2.2 Data Type Portability p.27 Numeric to Categorical Data: Discretization p.28 Categorical to Numeric Data: Binarization p.29 Text to Numeric Data p.29 Time Series to Discrete Sequence Data p.30 Time Series to Numeric Data p.30 Discrete Sequence to Numeric Data p.30 Spatial to Numeric Data p.31 Graphs to Numeric Data p.31 Any Type to Graphs for Similarity-based Applications p.31
2.3 Data Cleaning p.32
2.3.1 Handling Missing Entries p.33
2.3.2 Handling Incorrect and Inconsistent Entries p.33
2.3.3 Scaling and Normalization p.34
2.4 Data Reduction and Transformation p.35
2.4.1 Sampling p.35 Sampling for Static Data p.36 Reservoir Sampling for Data Streams p.36
2.4.2 Feature Subset Selection p.38
2.4.3 Dimensionality Reduction with Axis Rotation p.38 Principal Component Analysis p.39 Singular Value Decomposition p.41 Latent Semantic Analysis p.45 Applications of PCA and SVD p.45
2.4.4 Dimensionality Reduction with Type Transformation p.46 Haar Wavelet Transform p.47 Multidimensional Scaling p.52 Spectral Transformation and Embedding of Graphs p.54
2.5 Summary p.56
2.6 Bibliographic Notes p.57
2.7 Exercises p.57

3 Similarity and Distances

3.1 Introduction p.59
3.2 Multidimensional Data p.60
3.2.1 Quantitative Data p.60 Impact of Domain-specific Relevance p.61 Impact of High Dimensionality p.61 Impact of Locally Irrelevant Features p.62 Impact of Different Lp-norms p.63 Match-based Similarity Computation p.64 Impact of Data Distribution p.65 Nonlinear Distributions: ISOMAP p.66 Impact of Local Data Distribution p.67 Computational Considerations p.69
3.2.2 Categorical Data p.69
3.2.3 Mixed Quantitative and Categorical Data p.70
3.3 Text Similarity Measures p.71
3.3.1 Binary and Set Data p.72
3.4 Temporal Similarity Measures p.72
3.4.1 Time-Series Similarity Measures p.73 Impact of Behavioral Attribute Normalization p.74 Lp-norm p.74 Dynamic Time Warping Distance p.74 Window-based Methods p.77
3.4.2 Discrete Sequence Similarity Measures p.77 Edit Distance p.77 Longest Common Subsequence p.79
3.5 Graph Similarity Measures p.80
3.5.1 Similarity between Two Nodes in a Single Graph p.80 Structural Distance-based Measure p.80 Random Walk-based Similarity p.81
3.5.2 Similarity between Two Graphs p.81
3.6 Supervised Similarity Functions p.82
3.7 Summary p.83
3.8 Bibliographic Notes p.84
3.9 Exercises p.85

4 Association Pattern Mining

4.1 Introduction p.87
4.2 The Frequent Pattern Mining Model p.88
4.3 Association Rule Generation Framework p.91
4.4 Frequent Itemset Mining Algorithms p.92
4.4.1 Brute Force Algorithms p.93
4.4.2 The Apriori Algorithm p.94 Efficient Support Counting p.95
4.4.3 Enumeration-Tree Algorithms p.96 Enumeration-Tree-based Interpretation of Apriori p.99 Tree Projection and Depth Project p.99 Vertical Counting Methods p.104
4.4.4 Recursive Suffix-based Pattern Growth Methods p.106 Implementation with Arrays but no Pointers p.107 Implementation with Pointers but no F P-Tree p.108 Implementation with Pointers and F P-Tree p.109 Trade-offs with Different Data Structures p.112 Relationship between F P-growth and Enumeration-Tree Methods p.113
4.5 Alternative Models: Interesting Patterns p.115
4.5.1 Statistical Coefficient of Correlation p.116
4.5.2 ?2 Measure p.116
4.5.3 Interest Ratio p.117
4.5.4 Symmetric Confidence Measures p.117
4.5.5 Cosine Coefficient on Columns p.118
4.5.6 Jaccard Coefficient and the Min-hash Trick p.118
4.5.7 Collective Strength p.119
4.5.8 Relationship to Negative Pattern Mining p.120
4.6 Useful Meta-Algorithms p.120
4.6.1 Sampling Methods p.120
4.6.2 Data Partitioned Ensembles p.121
4.6.3 Generalization to Other Data Types p.121 Quantitative Data p.122 Categorical Data p.122
4.7 Summary p.122
4.8 Bibliographic Notes p.123
4.9 Exercises p.124

5 Association Pattern Mining: Advanced Concepts

5.1 Introduction p.127
5.2 Pattern Summarization p.128
5.2.1 Maximal Patterns p.128
5.2.2 Closed Patterns p.129
5.2.3 Approximate Frequent Patterns p.131 Approximation in Terms of Transactions p.131 Approximation in Terms of Itemsets p.132
5.3 Pattern Querying p.133
5.3.1 Preprocess-once Query-many Paradigm p.133 Leveraging the Itemset Lattice p.133 Leveraging Data Structures for Querying p.134
5.3.2 Pushing Constraints into Pattern Mining p.138
5.4 Putting Associations to Work: Applications p.139
5.4.1 Relationship to Other Data Mining Problems p.139 Application to Classification p.139 Application to Clustering p.139 Applications to Outlier Detection p.139
5.4.2 Market Basket Analysis p.140
5.4.3 Demographic and Profile Analysis p.140
5.4.4 Recommendations and Collaborative Filtering p.140
5.4.5 Web Log Analysis p.140
5.4.6 Bioinformatics p.141
5.4.7 Other Applications for Complex Data Types p.141
5.5 Summary p.141
5.6 Bibliographic Notes p.142
5.7 Exercises p.143

6 Cluster Analysis

6.1 Introduction p.145
6.2 Feature Selection for Clustering p.146
6.2.1 Filter Models p.147 Term Strength p.147 Predictive Attribute Dependence p.147 Entropy p.148 Hopkins Statistic p.149
6.2.2 Wrapper Models p.150
6.3 Representative-based Algorithms p.151
6.3.1 The k-Means Algorithm p.154
6.3.2 The Kernel k-Means Algorithm p.155
6.3.3 The k-Medians Algorithm p.155
6.3.4 The k-Medoids Algorithm p.155
6.4 Hierarchical Clustering Algorithms p.158
6.4.1 Bottom-up Agglomerative Methods p.159 Group-based Statistics p.160
6.4.2 Top-down Divisive Methods p.163 Bisecting k-Means p.164
6.5 Probabilistic Model-based Algorithms p.164
6.5.1 Relationship of E M to k-means and other Representative Methods p.167
6.6 Grid-based and Density-based Algorithms p.169
6.6.1 Grid-based Methods p.170
6.6.2 DBSCAN p.172
6.6.3 DENCLUE p.174
6.7 Graph-based Algorithms p.177
6.7.1 Properties of Graph-based Algorithms p.180
6.8 Nonnegative Matrix Factorization p.181
6.8.1 Comparison with Singular Value Decomposition p.185
6.9 Cluster Validation p.186
6.9.1 Internal Validation Criteria p.186 Parameter Tuning with Internal Measures p.188
6.9.2 External Validation Criteria p.189
6.9.3 General Comments p.191
6.10 Summary p.191
6.11 Bibliographic Notes p.192
6.12 Exercises p.192

7 Cluster Analysis: Advanced Concepts

7.1 Introduction p.195
7.2 Clustering Categorical Data p.196
7.2.1 Representative-based Algorithms p.197 k-Modes Clustering p.198 k-Medoids Clustering p.198
7.2.2 Hierarchical Algorithms p.199 ROCK p.199
7.2.3 Probabilistic Algorithms p.200
7.2.4 Graph-based Algorithms p.202
7.3 Scalable Data Clustering p.202
7.3.1 CLARANS p.202
7.3.2 BIRCH p.203
7.3.3 CURE p.205
7.4 High-Dimensional Clustering p.207
7.4.1 CLIQUE p.208
7.4.2 PROCLUS p.209
7.4.3 ORCLUS p.212
7.5 Semisupervised Clustering p.214
7.5.1 Pointwise Supervision p.214
7.5.2 Pairwise Supervision p.215
7.6 Human and Visually Supervised Clustering p.216
7.6.1 Modifications of Existing Clustering Algorithms p.217
7.6.2 Visual Clustering p.217
7.7 Cluster Ensembles p.220
7.7.1 Selecting Different Ensemble Components p.221
7.7.2 Combining Different Ensemble Components p.221 Hypergraph Partitioning Algorithm p.221 Meta-clustering Algorithm p.222
7.8 Putting Clustering to Work: Applications p.222
7.8.1 Applications to Other Data Mining Problems p.222 Data Summarization p.222 Outlier Analysis p.222 Classification p.223 Dimensionality Reduction p.223 Similarity Search and Indexing p.223
7.8.2 Customer Segmentation and Collaborative Filtering p.223
7.8.3 Text Applications p.223
7.8.4 Multimedia Applications p.224
7.8.5 Temporal and Sequence Applications p.224
7.8.6 Social Network Analysis p.224
7.9 Summary p.224
7.10 Bibliographic Notes p.224
7.11 Exercises p.225

8 Outlier Analysis

8.1 Introduction p.227
8.2 Extreme Value Analysis p.229
8.2.1 Univariate Extreme Value Analysis p.230
8.2.2 Multivariate Extreme Values p.231
8.2.3 Depth-based Methods p.233
8.3 Probabilistic Models p.234
8.4 Clustering for Outlier Detection p.236
8.5 Distance-based Outlier Detection p.238
8.5.1 Pruning Methods p.239 Sampling Methods p.239 Early Termination Trick with Nested Loops p.239
8.5.2 Local Distance Correction Methods p.240 Local Outlier Factor ( LOF) p.242 Instance-specific Mahalanobis Distance p.243
8.6 Density-based Methods p.244
8.6.1 Histogram- and Grid-based Techniques p.244
8.6.2 Kernel Density Estimation p.245
8.7 Information-Theoretic Models p.246
8.8 Outlier Validity p.247
8.8.1 Methodological Challenges p.248
8.8.2 Receiver Operating Characteristic p.249
8.8.3 Common Mistakes p.250
8.9 Summary p.250
8.10 Bibliographic Notes p.251
8.11 Exercises p.251

9 Outlier Analysis: Advanced Concepts

9.1 Introduction p.253
9.2 Outlier Detection with Categorical Data p.254
9.2.1 Probabilistic Models p.254
9.2.2 Clustering and Distance-based Methods p.255
9.2.3 Binary and Set-Valued Data p.256
9.3 High-Dimensional Outlier Detection p.256
9.3.1 Grid-based Rare Subspace Exploration p.258 Modeling Abnormal Lower Dimensional Projections p.258 Grid Search for Subspace Outliers p.259
9.3.2 Random Subspace Sampling p.261
9.4 Outlier Ensembles p.262
9.4.1 Categorization by Component Independence p.263 Sequential Ensembles p.263 Independent Ensembles p.264
9.4.2 Categorization by Constituent Components p.264 Model-centered Ensembles p.265 Data-centered Ensembles p.265
9.4.3 Normalization and Combination p.265
9.5 Putting Outliers to Work: Applications p.267
9.5.1 Quality Control and Fault Detection p.267
9.5.2 Financial Fraud and Anomalous Events p.267
9.5.3 Web Log Analytics p.268
9.5.4 Intrusion Detection Applications p.268
9.5.5 Biological and Medical Applications p.268
9.5.6 Earth Science Applications p.268
9.6 Summary p.269
9.7 Bibliographic Notes p.269
9.8 Exercises p.270

10 Data Classification

10.1 Introduction p.271
10.2 Feature Selection for Classification p.273
10.2.1 Filter Models p.274 Gini Index p.274 Entropy p.275 Fisher Score p.275 Fisher’s Linear Discriminant p.276
10.2.2 Wrapper Models p.277
10.2.3 Embedded Models p.278
10.3 Decision Trees p.278
10.3.1 Split Criteria p.281
10.3.2 Stopping Criterion and Pruning p.283
10.3.3 Practical Issues p.284
10.4 Rule-based Classifiers p.284
10.4.1 Rule Generation from Decision Trees p.286
10.4.2 Sequential Covering Algorithms p.287 Learn-One-Rule p.287
10.4.3 Rule Pruning p.290
10.4.4 Associative Classifiers p.290
10.5 Probabilistic Classifiers p.291
10.5.1 Naive Bayes Classifier p.291 The Ranking Model for Classification p.294 Discussion of the Naive Assumption p.295
10.5.2 Logistic Regression p.295 Training a Logistic Regression Classifier p.297 Relationship with Other Linear Models p.298
10.6 Support Vector Machines p.298
10.6.1 Support Vector Machines for Linearly Separable Data p.298 Solving the Lagrangian Dual p.303
10.6.2 Support Vector Machines with Soft Margin for Nonseparable Data p.304 Comparison with other Linear Models p.306
10.6.3 Nonlinear Support Vector Machines p.306
10.6.4 The Kernel Trick p.307 Other Applications of Kernel Methods p.309
10.7 Neural Networks p.311
10.7.1 Single-Layer Neural Network: The Perceptron p.311
10.7.2 Multilayer Neural Networks p.313
10.7.3 Comparing Various Linear Models p.315
10.8 Instance-based Learning p.316
10.8.1 Design Variations of Nearest Neighbor Classifiers p.316 Unsupervised Mahalanobis Metric p.317 Nearest Neighbors with Linear Discriminant Analysis p.317
10.9 Classifier Evaluation p.319
10.9.1 Methodological Issues p.319 Holdout p.320 Cross-Validation p.320 Bootstrap p.321
10.9.2 Quantification Issues p.322 Output as Class Labels p.322 Output as Numerical Score p.323
10.10 Summary p.326
10.11 Bibliographic Notes p.326
10.12 Exercises p.327

11 Data Classification: Advanced Concepts

11.1 Introduction p.331
11.2 Multiclass Learning p.332
11.3 Rare Class Learning p.333
11.3.1 Example Re-weighting p.334
11.3.2 Sampling Methods p.335 Relationship between Weighting and Sampling p.336 Synthetic Over-sampling: SMOTE p.336
11.4 Scalable Classification p.336
11.4.1 Scalable Decision Trees p.337 Rain Forest p.337 BOAT. p.337
11.4.2 Scalable Support Vector Machines p.337
11.5 Regression Modeling with Numeric Classes p.339
11.5.1 Linear Regression. p.339 Relationship with Fisher’s Linear Discriminant p.341
11.5.2 Principal Component Regression p.342
11.5.3 Generalized Linear Models p.343
11.5.4 Nonlinear and Polynomial Regression p.344
11.5.5 From Decision Trees to Regression Trees p.345
11.5.6 Assessing Model Effectiveness p.346
11.6 Semisupervised Learning p.346
11.6.1 Generic Meta-Algorithms p.348 Self-Training p.348 Co-Training p.348
11.6.2 Specific Variations of Classification Algorithms p.349 Semisupervised Bayes Classification with E M p.349 Transductive Support Vector Machines p.351
11.6.3 Graph-based Semisupervised Learning p.352
11.6.4 Discussion of Semisupervised Learning p.353
11.7 Active Learning p.353
11.7.1 Heterogeneity-based Models p.355 Uncertainty Sampling p.355 Query-by-Committee p.356 Expected Model Change p.356
11.7.2 Performance-based Models p.357 Expected Error Reduction p.357 Expected Variance Reduction p.358
11.7.3 Representativeness-based Models p.358
11.8 Ensemble Methods p.358
11.8.1 Why does Ensemble Analysis Work? p.360
11.8.2 Formal Statement of Bias-Variance Trade-off p.362
11.8.3 Specific Instantiations of Ensemble Learning p.364 Bagging p.364 Random Forests p.365 Boosting p.366 Bucket of Models p.368 Stacking p.368
11.9 Summary p.369
11.10 Bibliographic Notes p.370
11.11 Exercises p.370

12 Mining Data Streams

12.1 Introduction p.373
12.2 Synopsis Data Structures for Streams p.375
12.2.1 Reservoir Sampling p.375 Handling Concept Drift p.376 Useful Theoretical Bounds for Sampling p.377
12.2.2 Synopsis Structures for the Massive-Domain Scenario p.381 Bloom Filter p.382 Count-Min Sketch p.386 AMSSketch p.389 Flajolet-Martin Algorithm for Distinct Element Counting p.391
12.3 Frequent Pattern Mining in Data Streams p.392
12.3.1 Leveraging Synopsis Structures p.392 Reservoir Sampling p.393 Sketches p.393
12.3.2 Lossy Counting Algorithm. p.393
12.4 Clustering Data Streams p.394
12.4.1 STREAMAlgorithm p.394
12.4.2 Clu Stream Algorithm p.396 Micro-cluster Definition p.396 Micro-clustering Algorithm p.397 Pyramidal Time Frame p.398
12.4.3 Massive-Domain Stream Clustering p.400
12.5 Streaming Outlier Detection p.400
12.5.1 Individual Data Points as Outliers p.401
12.5.2 Aggregate Change Points as Outliers p.402
12.6 Streaming Classification p.404
12.6.1 VFDTFamily p.404
12.6.2 Supervised Micro-cluster Approach p.406
12.6.3 Ensemble Method p.407
12.6.4 Massive-Domain Streaming Classification p.407
12.7 Summary p.408
12.8 Bibliographic Notes p.408
12.9 Exercises p.408

13 Mining Text Data

13.1 Introduction p.411
13.2 Document Preparation and Similarity Computation p.413
13.2.1 Document Normalization and Similarity Computation p.413
13.2.2 Specialized Preprocessing for Web Documents p.415
13.3 Specialized Clustering Methods for Text p.416
13.3.1 Representative-based Algorithms p.416 Scatter/ Gather Approach p.416
13.3.2 Probabilistic Algorithms p.418
13.3.3 Simultaneous Document and Word Cluster Discovery p.419 Co-clustering p.420
13.4 Topic Modeling p.422
13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis p.425
13.4.2 Use in Clustering and Comparison with Probabilistic Clustering p.427
13.4.3 Limitations of PLSA p.427
13.5 Specialized Classification Methods for Text p.428
13.5.1 Instance-based Classifiers p.428 Leveraging Latent Semantic Analysis p.428 Centroid-based Classification p.429 Rocchio Classification p.429
13.5.2 Bayes Classifiers p.430 Multinomial Bayes Model p.430
13.5.3 SVMClassifiers for High-dimensional and Sparse Data p.432
13.6 Novelty and First-Story Detection p.434
13.6.1 Micro-clustering Method p.435
13.7 Summary p.435
13.8 Bibliographic Notes p.436
13.9 Exercises p.436

14 Mining Time-Series Data

14.1 Introduction p.439
14.2 Time-Series Preparation and Similarity. p.441
14.2.1 Handling Missing Values p.441
14.2.2 Noise Removal p.441
14.2.3 Normalization p.443
14.2.4 Data Transformation and Reduction p.444 Discrete Wavelet Transform. p.444 Discrete Fourier Transform p.444 Symbolic Aggregate Approximation ( SAX) p.445
14.2.5 Time-Series Similarity Measures p.446
14.3 Time-Series Forecasting p.446
14.3.1 Autoregressive Models p.448
14.3.2 Autoregressive Moving Average Models p.450
14.3.3 Multivariate Forecasting with Hidden Variables p.451
14.4 Time-Series Motifs p.453
14.4.1 Distance-based Motifs p.454
14.4.2 Transformation to Sequential Pattern Mining p.456
14.4.3 Periodic Patterns p.457
14.5 Time-Series Clustering p.458
14.5.1 Online Clustering of Co-evolving Series p.458
14.5.2 Shape-based Clustering p.460 k-Means p.461 k-Medoids p.462 Hierarchical Methods p.462 Graph-based Methods p.462
14.6 Time-Series Outlier Detection p.462
14.6.1 Point Outliers p.463
14.6.2 Shape Outliers p.464
14.7 Time-Series Classification p.465
14.7.1 Supervised Event Detection p.466
14.7.2 Whole-Series Classification p.468 Wavelet-based Rules p.469 Nearest Neighbor Classifier p.469 Graph-based Methods p.470
14.8 Summary p.470
14.9 Bibliographic Notes p.470
14.10 Exercises p.471

15 Mining Discrete Sequences

15.1 Introduction p.473
15.2 Sequential Pattern Mining p.474
15.2.1 Frequent Patterns to Frequent Sequences p.477
15.2.2 Constrained Sequential Pattern Mining p.480
15.3 Sequence Clustering p.481
15.3.1 Distance-based Methods p.482
15.3.2 Graph-based Methods p.482
15.3.3 Subsequence-based Clustering p.482
15.3.4 Probabilistic Clustering p.483 Markovian Similarity-based Algorithm: CLUSEQ p.483 Mixture of Hidden Markov Models p.486
15.4 Outlier Detection in Sequences p.487
15.4.1 Position Outliers p.487 Efficiency Issues: Probabilistic Suffix Trees p.490
15.4.2 Combination Outliers p.491 Distance-based Models p.492 Frequency-based Models p.493
15.5 Hidden Markov Models p.494
15.5.1 Formal Definition and Techniques for HMMs p.496
15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence p.497
15.5.3 Explanation: Determining the Most Likely State Sequence for Observed Sequence p.498
15.5.4 Training: Baum-Welch Algorithm p.499
15.5.5 Applications p.500
15.6 Sequence Classification p.501
15.6.1 Nearest Neighbor Classifier p.501
15.6.2 Graph-based Methods p.501
15.6.3 Rule-based Methods p.502
15.6.4 Kernel Support Vector Machines p.503 Bag-of-Words Kernel p.503 Spectrum Kernel p.503 Weighted Degree Kernel p.504
15.6.5 Probabilistic Methods: Hidden Markov Models p.504
15.7 Summary p.505
15.8 Bibliographic Notes p.505
15.9 Exercises p.506

16 Mining Spatial Data

16.1 Introduction p.509
16.2 Mining with Contextual Spatial Attributes p.510
16.2.1 Shape to Time-Series Transformation p.512
16.2.2 Spatial to Multidimensional Transformation with Wavelets p.515
16.2.3 Spatial Co-location Patterns p.516
16.2.4 Clustering Shapes p.517
16.2.5 Outlier Detection p.518 Point Outliers p.518 Shape Outliers p.520
16.2.6 Classification of Shapes p.521
16.3 Trajectory Mining p.522
16.3.1 Equivalence of Trajectories and Multivariate Time Series p.522
16.3.2 Converting Trajectories to Multidimensional Data p.523
16.3.3 Trajectory Pattern Mining p.524 Frequent Trajectory Paths p.524 Co-location Patterns p.526
16.3.4 Trajectory Clustering p.526 Computing Similarity between Trajectories p.526 Similarity-based Clustering Methods p.527 Trajectory Clustering as a Sequence Clustering Problem p.528
16.3.5 Trajectory Outlier Detection p.528 Distance-based Methods p.529 Sequence-based Methods p.529
16.3.6 Trajectory Classification p.530 Distance-based Methods p.530 Sequence-based Methods p.530
16.4 Summary p.531
16.5 Bibliographic Notes p.531
16.6 Exercises p.532

17 Mining Graph Data

17.1 Introduction p.533
17.2 Matching and Distance Computation in Graphs p.535
17.2.1 Ullman’s Algorithm for Subgraph Isomorphism p.537 Algorithm Variations and Refinements p.539
17.2.2 Maximum Common Subgraph Problem p.540
17.2.3 Graph Matching Methods for Distance Computation p.541 Maximum Common Subgraph-based Distances p.541 Graph Edit Distance p.542
17.3 Transformation-based Distance Computation p.546
17.3.1 Frequent Substructure-based Transformation and Distance Computation p.546
17.3.2 Topological Descriptors p.547
17.3.3 Kernel-based Transformations and Computation p.549 Random-Walk Kernels p.549 Shortest-Path Kernels p.550
17.4 Frequent Substructure Mining in Graphs p.550
17.4.1 Node-based Join Growth p.552
17.4.2 Edge-based Join Growth p.554
17.4.3 Frequent Pattern Mining to Graph Pattern Mining p.554
17.5 Graph Clustering p.554
17.5.1 Distance-based Methods p.555
17.5.2 Frequent Substructure-based Methods p.555 Generic Transformational Approach p.556 X Proj: Direct Clustering with Frequent Subgraph Discovery 556
17.6 Graph Classification p.558
17.6.1 Distance-based Methods p.558
17.6.2 Frequent Substructure-based Methods p.558 Generic Transformational Approach p.559 X Rules: A Rule-based Approach p.559
17.6.3 Kernel Support Vector Machines p.560
17.7 Summary p.560
17.8 Bibliographic Notes p.561
17.9 Exercises p.562

18 Mining Web Data

18.1 Introduction p.565
18.2 Web Crawling and Resource Discovery p.567
18.2.1 A Basic Crawler Algorithm p.567
18.2.2 Preferential Crawlers p.569
18.2.3 Multiple Threads p.569
18.2.4 Combatting Spider Traps p.569
18.2.5 Shingling for Near Duplicate Detection p.570
18.3 Search Engine Indexing and Query Processing p.570
18.4 Ranking Algorithms p.573
18.4.1 Page Rank p.573 Topic-Sensitive Page Rank p.576 Sim Rank p.577
18.4.2 HITS p.578
18.5 Recommender Systems p.579
18.5.1 Content-based Recommendations p.581
18.5.2 Neighborhood-based Methods for Collaborative Filtering p.582 User-based Similarity with Ratings p.582 Item-based Similarity with Ratings p.583
18.5.3 Graph-based Methods p.583
18.5.4 Clustering Methods p.585 Adapting k-Means Clustering p.585 Adapting Co-Clustering p.585
18.5.5 Latent Factor Models p.586 Singular Value Decomposition p.587 Matrix Factorization p.587
18.6 Web Usage Mining p.588
18.6.1 Data Preprocessing p.589
18.6.2 Applications p.589
18.7 Summary p.590
18.8 Bibliographic Notes p.591
18.9 Exercises p.591

19 Social Network Analysis

19.1 Introduction p.593
19.2 Social Networks: Preliminaries and Properties p.594
19.2.1 Homophily p.595
19.2.2 Triadic Closure and Clustering Coefficient p.595
19.2.3 Dynamics of Network Formation p.595
19.2.4 Power-Law Degree Distributions p.597
19.2.5 Measures of Centrality and Prestige p.597 Degree Centrality and Prestige p.597 Closeness Centrality and Proximity Prestige p.598 Betweenness Centrality p.600 Rank Centrality and Prestige p.600
19.3 Community Detection p.601
19.3.1 Kernighan-Lin Algorithm p.602 Speeding up Kernighan-Lin p.604
19.3.2 Girvan-Newman Algorithm p.604
19.3.3 Multilevel Graph Partitioning: METIS p.607
19.3.4 Spectral Clustering p.610 Important Observations and Intuitions p.613
19.4 Collective Classification p.614
19.4.1 Iterative Classification Algorithm p.615
19.4.2 Label Propagation with Random Walks p.616 Iterative Label Propagation: The Spectral Interpretation p.619
19.4.3 Supervised Spectral Methods p.619 Supervised Feature Generation with Spectral Embedding p.620 Graph Regularization Approach p.620 Connections with Random-Walk Methods p.622
19.5 Link Prediction p.623
19.5.1 Neighborhood-based Measures p.623
19.5.2 Katz Measure p.624
19.5.3 Random Walk-based Measures p.625
19.5.4 Link Prediction as a Classification Problem p.626
19.5.5 Link Prediction as a Missing Value Estimation Problem p.627
19.5.6 Discussion p.627
19.6 Social Influence Analysis p.627
19.6.1 Linear Threshold Model p.629
19.6.2 Independent Cascade Model p.629
19.6.3 Influence Function Evaluation p.630
19.7 Summary p.630
19.8 Bibliographic Notes p.631
19.9 Exercises p.632

20 Privacy-Preserving Data Mining

20.1 Introduction p.635
20.2 Privacy during Data Collection p.636
20.2.1 Reconstructing Aggregate Distributions p.637
20.2.2 Leveraging Aggregate Distributions for Data Mining p.639
20.3 Privacy-Preserving Data Publishing p.639
20.3.1 The k-anonymity Model p.641 Samarati’s Algorithm p.645 Incognito p.646 Mondrian Multidimensional k-Anonymity p.649 Synthetic Data Generation: Condensation-based Approach 651
20.3.2 The ?-diversity Model p.653
20.3.3 The t-closeness Model p.655
20.3.4 The Curse of Dimensionality p.658
20.4 Output Privacy p.658
20.5 Distributed Privacy p.659
20.6 Summary p.661
20.7 Bibliographic Notes p.661
20.8 Exercises p.663



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 DataMiningTheTextbookCharu C. AggarwalData Mining - The Textbook2015
AuthorCharu Aggarwal +
titleData Mining - The Textbook +
year2015 +