# 2015 DataMiningTheTextbook

## Quotes

### Abstract

This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues. It goes beyond the traditional focus on data mining problems to introduce advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. Until now, no single book has addressed all these topics in a comprehensive and integrated way. The chapters of this book fall into one of three categories: Fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems. Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor. Appropriate for both introductory and advanced data mining courses, Data Mining: The Textbook balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners (including those with a limited mathematical background). Numerous illustrations, examples, and exercises are included, with an emphasis on semantically interpretable examples.

### 1 An Introduction to Data Mining

```1.1 Introduction p.1
1.2 The Data Mining Process p.3
1.2.1 The Data Preprocessing Phase p.5
1.2.2 The Analytical Phase p.6
1.3 The Basic Data Types p.6
1.3.1 Non-dependency Oriented Data p.6
1.3.1.1 Quantitative Multidimensional Data p.7
1.3.1.2 Categorical and Mixed Attribute Data p.7
1.3.1.3 Binary and Set Data p.8
1.3.1.4 Text Data p.8
1.3.2 Dependency Oriented Data p.9
1.3.2.1 Time-Series Data p.9
1.3.2.2 Discrete Sequences and Strings p.10
1.3.2.3 Spatial Data p.11
1.3.2.4 Network and Graph Data p.12
1.4 The Major Building Blocks: A Bird’s Eye View p.13
1.4.1 Association Pattern Mining p.14
1.4.2 Data Clustering p.15
1.4.3 Outlier Detection p.16
1.4.4 Data Classification p.17
1.4.5 Impact of Complex Data Types on Problem Definitions p.18
1.4.5.1 Pattern Mining with Complex Data Types p.18
1.4.5.2 Clustering with Complex Data Types p.19
1.4.5.3 Outlier Detection with Complex Data Types p.19
1.4.5.4 Classification with Complex Data Types p.20
1.5 Scalability Issues and the Streaming Scenario p.20
1.6 A Stroll through some Application Scenarios p.21
1.6.1 Store Product Placement p.21
1.6.2 Customer Recommendations p.21
1.6.3 Medical Diagnosis p.22
1.6.4 Web Log Anomalies p.22
1.7 Summary p.23
1.8 Bibliographic Notes p.23
1.9 Exercises p.24
```

### 2 Data Preparation

```2.1 Introduction p.25
2.2 Feature Extraction and Portability p.26
2.2.1 Feature Extraction p.26
2.2.2 Data Type Portability p.27
2.2.2.1 Numeric to Categorical Data: Discretization p.28
2.2.2.2 Categorical to Numeric Data: Binarization p.29
2.2.2.3 Text to Numeric Data p.29
2.2.2.4 Time Series to Discrete Sequence Data p.30
2.2.2.5 Time Series to Numeric Data p.30
2.2.2.6 Discrete Sequence to Numeric Data p.30
2.2.2.7 Spatial to Numeric Data p.31
2.2.2.8 Graphs to Numeric Data p.31
2.2.2.9 Any Type to Graphs for Similarity-based Applications p.31
2.3 Data Cleaning p.32
2.3.1 Handling Missing Entries p.33
2.3.2 Handling Incorrect and Inconsistent Entries p.33
2.3.3 Scaling and Normalization p.34
2.4 Data Reduction and Transformation p.35
2.4.1 Sampling p.35
2.4.1.1 Sampling for Static Data p.36
2.4.1.2 Reservoir Sampling for Data Streams p.36
2.4.2 Feature Subset Selection p.38
2.4.3 Dimensionality Reduction with Axis Rotation p.38
2.4.3.1 Principal Component Analysis p.39
2.4.3.2 Singular Value Decomposition p.41
2.4.3.3 Latent Semantic Analysis p.45
2.4.3.4 Applications of PCA and SVD p.45
2.4.4 Dimensionality Reduction with Type Transformation p.46
2.4.4.1 Haar Wavelet Transform p.47
2.4.4.2 Multidimensional Scaling p.52
2.4.4.3 Spectral Transformation and Embedding of Graphs p.54
2.5 Summary p.56
2.6 Bibliographic Notes p.57
2.7 Exercises p.57
```

### 3 Similarity and Distances

```3.1 Introduction p.59
3.2 Multidimensional Data p.60
3.2.1 Quantitative Data p.60
3.2.1.1 Impact of Domain-specific Relevance p.61
3.2.1.2 Impact of High Dimensionality p.61
3.2.1.3 Impact of Locally Irrelevant Features p.62
3.2.1.4 Impact of Different Lp-norms p.63
3.2.1.5 Match-based Similarity Computation p.64
3.2.1.6 Impact of Data Distribution p.65
3.2.1.7 Nonlinear Distributions: ISOMAP p.66
3.2.1.8 Impact of Local Data Distribution p.67
3.2.1.9 Computational Considerations p.69
3.2.2 Categorical Data p.69
3.2.3 Mixed Quantitative and Categorical Data p.70
3.3 Text Similarity Measures p.71
3.3.1 Binary and Set Data p.72
3.4 Temporal Similarity Measures p.72
3.4.1 Time-Series Similarity Measures p.73
3.4.1.1 Impact of Behavioral Attribute Normalization p.74
3.4.1.2 Lp-norm p.74
3.4.1.3 Dynamic Time Warping Distance p.74
3.4.1.4 Window-based Methods p.77
3.4.2 Discrete Sequence Similarity Measures p.77
3.4.2.1 Edit Distance p.77
3.4.2.2 Longest Common Subsequence p.79
3.5 Graph Similarity Measures p.80
3.5.1 Similarity between Two Nodes in a Single Graph p.80
3.5.1.1 Structural Distance-based Measure p.80
3.5.1.2 Random Walk-based Similarity p.81
3.5.2 Similarity between Two Graphs p.81
3.6 Supervised Similarity Functions p.82
3.7 Summary p.83
3.8 Bibliographic Notes p.84
3.9 Exercises p.85
```

### 4 Association Pattern Mining

```4.1 Introduction p.87
4.2 The Frequent Pattern Mining Model p.88
4.3 Association Rule Generation Framework p.91
4.4 Frequent Itemset Mining Algorithms p.92
4.4.1 Brute Force Algorithms p.93
4.4.2 The Apriori Algorithm p.94
4.4.2.1 Efficient Support Counting p.95
4.4.3 Enumeration-Tree Algorithms p.96
4.4.3.1 Enumeration-Tree-based Interpretation of Apriori p.99
4.4.3.2 Tree Projection and Depth Project p.99
4.4.3.3 Vertical Counting Methods p.104
4.4.4 Recursive Suffix-based Pattern Growth Methods p.106
4.4.4.1 Implementation with Arrays but no Pointers p.107
4.4.4.2 Implementation with Pointers but no F P-Tree p.108
4.4.4.3 Implementation with Pointers and F P-Tree p.109
4.4.4.4 Trade-offs with Different Data Structures p.112
4.4.4.5 Relationship between F P-growth and Enumeration-Tree Methods p.113
4.5 Alternative Models: Interesting Patterns p.115
4.5.1 Statistical Coefficient of Correlation p.116
4.5.2 ?2 Measure p.116
4.5.3 Interest Ratio p.117
4.5.4 Symmetric Confidence Measures p.117
4.5.5 Cosine Coefficient on Columns p.118
4.5.6 Jaccard Coefficient and the Min-hash Trick p.118
4.5.7 Collective Strength p.119
4.5.8 Relationship to Negative Pattern Mining p.120
4.6 Useful Meta-Algorithms p.120
4.6.1 Sampling Methods p.120
4.6.2 Data Partitioned Ensembles p.121
4.6.3 Generalization to Other Data Types p.121
4.6.3.1 Quantitative Data p.122
4.6.3.2 Categorical Data p.122
4.7 Summary p.122
4.8 Bibliographic Notes p.123
4.9 Exercises p.124
```

### 5 Association Pattern Mining: Advanced Concepts

```5.1 Introduction p.127
5.2 Pattern Summarization p.128
5.2.1 Maximal Patterns p.128
5.2.2 Closed Patterns p.129
5.2.3 Approximate Frequent Patterns p.131
5.2.3.1 Approximation in Terms of Transactions p.131
5.2.3.2 Approximation in Terms of Itemsets p.132
5.3 Pattern Querying p.133
5.3.1.1 Leveraging the Itemset Lattice p.133
5.3.1.2 Leveraging Data Structures for Querying p.134
5.3.2 Pushing Constraints into Pattern Mining p.138
5.4 Putting Associations to Work: Applications p.139
5.4.1 Relationship to Other Data Mining Problems p.139
5.4.1.1 Application to Classification p.139
5.4.1.2 Application to Clustering p.139
5.4.1.3 Applications to Outlier Detection p.139
5.4.3 Demographic and Profile Analysis p.140
5.4.4 Recommendations and Collaborative Filtering p.140
5.4.5 Web Log Analysis p.140
5.4.6 Bioinformatics p.141
5.4.7 Other Applications for Complex Data Types p.141
5.5 Summary p.141
5.6 Bibliographic Notes p.142
5.7 Exercises p.143
```

### 6 Cluster Analysis

```6.1 Introduction p.145
6.2 Feature Selection for Clustering p.146
6.2.1 Filter Models p.147
6.2.1.1 Term Strength p.147
6.2.1.2 Predictive Attribute Dependence p.147
6.2.1.3 Entropy p.148
6.2.1.4 Hopkins Statistic p.149
6.2.2 Wrapper Models p.150
6.3 Representative-based Algorithms p.151
6.3.1 The k-Means Algorithm p.154
6.3.2 The Kernel k-Means Algorithm p.155
6.3.3 The k-Medians Algorithm p.155
6.3.4 The k-Medoids Algorithm p.155
6.4 Hierarchical Clustering Algorithms p.158
6.4.1 Bottom-up Agglomerative Methods p.159
6.4.1.1 Group-based Statistics p.160
6.4.2 Top-down Divisive Methods p.163
6.4.2.1 Bisecting k-Means p.164
6.5 Probabilistic Model-based Algorithms p.164
6.5.1 Relationship of E M to k-means and other Representative Methods p.167
6.6 Grid-based and Density-based Algorithms p.169
6.6.1 Grid-based Methods p.170
6.6.2 DBSCAN p.172
6.6.3 DENCLUE p.174
6.7 Graph-based Algorithms p.177
6.7.1 Properties of Graph-based Algorithms p.180
6.8 Nonnegative Matrix Factorization p.181
6.8.1 Comparison with Singular Value Decomposition p.185
6.9 Cluster Validation p.186
6.9.1 Internal Validation Criteria p.186
6.9.1.1 Parameter Tuning with Internal Measures p.188
6.9.2 External Validation Criteria p.189
6.10 Summary p.191
6.11 Bibliographic Notes p.192
6.12 Exercises p.192
```

### 7 Cluster Analysis: Advanced Concepts

```7.1 Introduction p.195
7.2 Clustering Categorical Data p.196
7.2.1 Representative-based Algorithms p.197
7.2.1.1 k-Modes Clustering p.198
7.2.1.2 k-Medoids Clustering p.198
7.2.2 Hierarchical Algorithms p.199
7.2.2.1 ROCK p.199
7.2.3 Probabilistic Algorithms p.200
7.2.4 Graph-based Algorithms p.202
7.3 Scalable Data Clustering p.202
7.3.1 CLARANS p.202
7.3.2 BIRCH p.203
7.3.3 CURE p.205
7.4 High-Dimensional Clustering p.207
7.4.1 CLIQUE p.208
7.4.2 PROCLUS p.209
7.4.3 ORCLUS p.212
7.5 Semisupervised Clustering p.214
7.5.1 Pointwise Supervision p.214
7.5.2 Pairwise Supervision p.215
7.6 Human and Visually Supervised Clustering p.216
7.6.1 Modifications of Existing Clustering Algorithms p.217
7.6.2 Visual Clustering p.217
7.7 Cluster Ensembles p.220
7.7.1 Selecting Different Ensemble Components p.221
7.7.2 Combining Different Ensemble Components p.221
7.7.2.1 Hypergraph Partitioning Algorithm p.221
7.7.2.2 Meta-clustering Algorithm p.222
7.8 Putting Clustering to Work: Applications p.222
7.8.1 Applications to Other Data Mining Problems p.222
7.8.1.1 Data Summarization p.222
7.8.1.2 Outlier Analysis p.222
7.8.1.3 Classification p.223
7.8.1.4 Dimensionality Reduction p.223
7.8.1.5 Similarity Search and Indexing p.223
7.8.2 Customer Segmentation and Collaborative Filtering p.223
7.8.3 Text Applications p.223
7.8.4 Multimedia Applications p.224
7.8.5 Temporal and Sequence Applications p.224
7.8.6 Social Network Analysis p.224
7.9 Summary p.224
7.10 Bibliographic Notes p.224
7.11 Exercises p.225
```

### 8 Outlier Analysis

```8.1 Introduction p.227
8.2 Extreme Value Analysis p.229
8.2.1 Univariate Extreme Value Analysis p.230
8.2.2 Multivariate Extreme Values p.231
8.2.3 Depth-based Methods p.233
8.3 Probabilistic Models p.234
8.4 Clustering for Outlier Detection p.236
8.5 Distance-based Outlier Detection p.238
8.5.1 Pruning Methods p.239
8.5.1.1 Sampling Methods p.239
8.5.1.2 Early Termination Trick with Nested Loops p.239
8.5.2 Local Distance Correction Methods p.240
8.5.2.1 Local Outlier Factor ( LOF) p.242
8.5.2.2 Instance-specific Mahalanobis Distance p.243
8.6 Density-based Methods p.244
8.6.1 Histogram- and Grid-based Techniques p.244
8.6.2 Kernel Density Estimation p.245
8.7 Information-Theoretic Models p.246
8.8 Outlier Validity p.247
8.8.1 Methodological Challenges p.248
8.8.3 Common Mistakes p.250
8.9 Summary p.250
8.10 Bibliographic Notes p.251
8.11 Exercises p.251
```

### 9 Outlier Analysis: Advanced Concepts

```9.1 Introduction p.253
9.2 Outlier Detection with Categorical Data p.254
9.2.1 Probabilistic Models p.254
9.2.2 Clustering and Distance-based Methods p.255
9.2.3 Binary and Set-Valued Data p.256
9.3 High-Dimensional Outlier Detection p.256
9.3.1 Grid-based Rare Subspace Exploration p.258
9.3.1.1 Modeling Abnormal Lower Dimensional Projections p.258
9.3.1.2 Grid Search for Subspace Outliers p.259
9.3.2 Random Subspace Sampling p.261
9.4 Outlier Ensembles p.262
9.4.1 Categorization by Component Independence p.263
9.4.1.1 Sequential Ensembles p.263
9.4.1.2 Independent Ensembles p.264
9.4.2 Categorization by Constituent Components p.264
9.4.2.1 Model-centered Ensembles p.265
9.4.2.2 Data-centered Ensembles p.265
9.4.3 Normalization and Combination p.265
9.5 Putting Outliers to Work: Applications p.267
9.5.1 Quality Control and Fault Detection p.267
9.5.2 Financial Fraud and Anomalous Events p.267
9.5.3 Web Log Analytics p.268
9.5.4 Intrusion Detection Applications p.268
9.5.5 Biological and Medical Applications p.268
9.5.6 Earth Science Applications p.268
9.6 Summary p.269
9.7 Bibliographic Notes p.269
9.8 Exercises p.270
```

### 10 Data Classification

```10.1 Introduction p.271
10.2 Feature Selection for Classification p.273
10.2.1 Filter Models p.274
10.2.1.1 Gini Index p.274
10.2.1.2 Entropy p.275
10.2.1.3 Fisher Score p.275
10.2.1.4 Fisher’s Linear Discriminant p.276
10.2.2 Wrapper Models p.277
10.2.3 Embedded Models p.278
10.3 Decision Trees p.278
10.3.1 Split Criteria p.281
10.3.2 Stopping Criterion and Pruning p.283
10.3.3 Practical Issues p.284
10.4 Rule-based Classifiers p.284
10.4.1 Rule Generation from Decision Trees p.286
10.4.2 Sequential Covering Algorithms p.287
10.4.2.1 Learn-One-Rule p.287
10.4.3 Rule Pruning p.290
10.4.4 Associative Classifiers p.290
10.5 Probabilistic Classifiers p.291
10.5.1 Naive Bayes Classifier p.291
10.5.1.1 The Ranking Model for Classification p.294
10.5.1.2 Discussion of the Naive Assumption p.295
10.5.2 Logistic Regression p.295
10.5.2.1 Training a Logistic Regression Classifier p.297
10.5.2.2 Relationship with Other Linear Models p.298
10.6 Support Vector Machines p.298
10.6.1 Support Vector Machines for Linearly Separable Data p.298
10.6.1.1 Solving the Lagrangian Dual p.303
10.6.2 Support Vector Machines with Soft Margin for Nonseparable Data p.304
10.6.2.1 Comparison with other Linear Models p.306
10.6.3 Nonlinear Support Vector Machines p.306
10.6.4 The Kernel Trick p.307
10.6.4.1 Other Applications of Kernel Methods p.309
10.7 Neural Networks p.311
10.7.1 Single-Layer Neural Network: The Perceptron p.311
10.7.2 Multilayer Neural Networks p.313
10.7.3 Comparing Various Linear Models p.315
10.8 Instance-based Learning p.316
10.8.1 Design Variations of Nearest Neighbor Classifiers p.316
10.8.1.1 Unsupervised Mahalanobis Metric p.317
10.8.1.2 Nearest Neighbors with Linear Discriminant Analysis p.317
10.9 Classifier Evaluation p.319
10.9.1 Methodological Issues p.319
10.9.1.1 Holdout p.320
10.9.1.2 Cross-Validation p.320
10.9.1.3 Bootstrap p.321
10.9.2 Quantification Issues p.322
10.9.2.1 Output as Class Labels p.322
10.9.2.2 Output as Numerical Score p.323
10.10 Summary p.326
10.11 Bibliographic Notes p.326
10.12 Exercises p.327
```

### 11 Data Classification: Advanced Concepts

```11.1 Introduction p.331
11.2 Multiclass Learning p.332
11.3 Rare Class Learning p.333
11.3.1 Example Re-weighting p.334
11.3.2 Sampling Methods p.335
11.3.2.1 Relationship between Weighting and Sampling p.336
11.3.2.2 Synthetic Over-sampling: SMOTE p.336
11.4 Scalable Classification p.336
11.4.1 Scalable Decision Trees p.337
11.4.1.1 Rain Forest p.337
11.4.1.2 BOAT. p.337
11.4.2 Scalable Support Vector Machines p.337
11.5 Regression Modeling with Numeric Classes p.339
11.5.1 Linear Regression. p.339
11.5.1.1 Relationship with Fisher’s Linear Discriminant p.341
11.5.2 Principal Component Regression p.342
11.5.3 Generalized Linear Models p.343
11.5.4 Nonlinear and Polynomial Regression p.344
11.5.5 From Decision Trees to Regression Trees p.345
11.5.6 Assessing Model Effectiveness p.346
11.6 Semisupervised Learning p.346
11.6.1 Generic Meta-Algorithms p.348
11.6.1.1 Self-Training p.348
11.6.1.2 Co-Training p.348
11.6.2 Specific Variations of Classification Algorithms p.349
11.6.2.1 Semisupervised Bayes Classification with E M p.349
11.6.2.2 Transductive Support Vector Machines p.351
11.6.3 Graph-based Semisupervised Learning p.352
11.6.4 Discussion of Semisupervised Learning p.353
11.7 Active Learning p.353
11.7.1 Heterogeneity-based Models p.355
11.7.1.1 Uncertainty Sampling p.355
11.7.1.2 Query-by-Committee p.356
11.7.1.3 Expected Model Change p.356
11.7.2 Performance-based Models p.357
11.7.2.1 Expected Error Reduction p.357
11.7.2.2 Expected Variance Reduction p.358
11.7.3 Representativeness-based Models p.358
11.8 Ensemble Methods p.358
11.8.1 Why does Ensemble Analysis Work? p.360
11.8.2 Formal Statement of Bias-Variance Trade-off p.362
11.8.3 Specific Instantiations of Ensemble Learning p.364
11.8.3.1 Bagging p.364
11.8.3.2 Random Forests p.365
11.8.3.3 Boosting p.366
11.8.3.4 Bucket of Models p.368
11.8.3.5 Stacking p.368
11.9 Summary p.369
11.10 Bibliographic Notes p.370
11.11 Exercises p.370
```

### 12 Mining Data Streams

```12.1 Introduction p.373
12.2 Synopsis Data Structures for Streams p.375
12.2.1 Reservoir Sampling p.375
12.2.1.1 Handling Concept Drift p.376
12.2.1.2 Useful Theoretical Bounds for Sampling p.377
12.2.2 Synopsis Structures for the Massive-Domain Scenario p.381
12.2.2.1 Bloom Filter p.382
12.2.2.2 Count-Min Sketch p.386
12.2.2.3 AMSSketch p.389
12.2.2.4 Flajolet-Martin Algorithm for Distinct Element Counting p.391
12.3 Frequent Pattern Mining in Data Streams p.392
12.3.1 Leveraging Synopsis Structures p.392
12.3.1.1 Reservoir Sampling p.393
12.3.1.2 Sketches p.393
12.3.2 Lossy Counting Algorithm. p.393
12.4 Clustering Data Streams p.394
12.4.1 STREAMAlgorithm p.394
12.4.2 Clu Stream Algorithm p.396
12.4.2.1 Micro-cluster Definition p.396
12.4.2.2 Micro-clustering Algorithm p.397
12.4.2.3 Pyramidal Time Frame p.398
12.4.3 Massive-Domain Stream Clustering p.400
12.5 Streaming Outlier Detection p.400
12.5.1 Individual Data Points as Outliers p.401
12.5.2 Aggregate Change Points as Outliers p.402
12.6 Streaming Classification p.404
12.6.1 VFDTFamily p.404
12.6.2 Supervised Micro-cluster Approach p.406
12.6.3 Ensemble Method p.407
12.6.4 Massive-Domain Streaming Classification p.407
12.7 Summary p.408
12.8 Bibliographic Notes p.408
12.9 Exercises p.408
```

### 13 Mining Text Data

```13.1 Introduction p.411
13.2 Document Preparation and Similarity Computation p.413
13.2.1 Document Normalization and Similarity Computation p.413
13.2.2 Specialized Preprocessing for Web Documents p.415
13.3 Specialized Clustering Methods for Text p.416
13.3.1 Representative-based Algorithms p.416
13.3.1.1 Scatter/ Gather Approach p.416
13.3.2 Probabilistic Algorithms p.418
13.3.3 Simultaneous Document and Word Cluster Discovery p.419
13.3.3.1 Co-clustering p.420
13.4 Topic Modeling p.422
13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis p.425
13.4.2 Use in Clustering and Comparison with Probabilistic Clustering p.427
13.4.3 Limitations of PLSA p.427
13.5 Specialized Classification Methods for Text p.428
13.5.1 Instance-based Classifiers p.428
13.5.1.1 Leveraging Latent Semantic Analysis p.428
13.5.1.2 Centroid-based Classification p.429
13.5.1.3 Rocchio Classification p.429
13.5.2 Bayes Classifiers p.430
13.5.2.1 Multinomial Bayes Model p.430
13.5.3 SVMClassifiers for High-dimensional and Sparse Data p.432
13.6 Novelty and First-Story Detection p.434
13.6.1 Micro-clustering Method p.435
13.7 Summary p.435
13.8 Bibliographic Notes p.436
13.9 Exercises p.436
```

### 14 Mining Time-Series Data

```14.1 Introduction p.439
14.2 Time-Series Preparation and Similarity. p.441
14.2.1 Handling Missing Values p.441
14.2.2 Noise Removal p.441
14.2.3 Normalization p.443
14.2.4 Data Transformation and Reduction p.444
14.2.4.1 Discrete Wavelet Transform. p.444
14.2.4.2 Discrete Fourier Transform p.444
14.2.4.3 Symbolic Aggregate Approximation ( SAX) p.445
14.2.5 Time-Series Similarity Measures p.446
14.3 Time-Series Forecasting p.446
14.3.1 Autoregressive Models p.448
14.3.2 Autoregressive Moving Average Models p.450
14.3.3 Multivariate Forecasting with Hidden Variables p.451
14.4 Time-Series Motifs p.453
14.4.1 Distance-based Motifs p.454
14.4.2 Transformation to Sequential Pattern Mining p.456
14.4.3 Periodic Patterns p.457
14.5 Time-Series Clustering p.458
14.5.1 Online Clustering of Co-evolving Series p.458
14.5.2 Shape-based Clustering p.460
14.5.2.1 k-Means p.461
14.5.2.2 k-Medoids p.462
14.5.2.3 Hierarchical Methods p.462
14.5.2.4 Graph-based Methods p.462
14.6 Time-Series Outlier Detection p.462
14.6.1 Point Outliers p.463
14.6.2 Shape Outliers p.464
14.7 Time-Series Classification p.465
14.7.1 Supervised Event Detection p.466
14.7.2 Whole-Series Classification p.468
14.7.2.1 Wavelet-based Rules p.469
14.7.2.2 Nearest Neighbor Classifier p.469
14.7.2.3 Graph-based Methods p.470
14.8 Summary p.470
14.9 Bibliographic Notes p.470
14.10 Exercises p.471
```

### 15 Mining Discrete Sequences

```15.1 Introduction p.473
15.2 Sequential Pattern Mining p.474
15.2.1 Frequent Patterns to Frequent Sequences p.477
15.2.2 Constrained Sequential Pattern Mining p.480
15.3 Sequence Clustering p.481
15.3.1 Distance-based Methods p.482
15.3.2 Graph-based Methods p.482
15.3.3 Subsequence-based Clustering p.482
15.3.4 Probabilistic Clustering p.483
15.3.4.1 Markovian Similarity-based Algorithm: CLUSEQ p.483
15.3.4.2 Mixture of Hidden Markov Models p.486
15.4 Outlier Detection in Sequences p.487
15.4.1 Position Outliers p.487
15.4.1.1 Efficiency Issues: Probabilistic Suffix Trees p.490
15.4.2 Combination Outliers p.491
15.4.2.1 Distance-based Models p.492
15.4.2.2 Frequency-based Models p.493
15.5 Hidden Markov Models p.494
15.5.1 Formal Definition and Techniques for HMMs p.496
15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence p.497
15.5.3 Explanation: Determining the Most Likely State Sequence for Observed Sequence p.498
15.5.4 Training: Baum-Welch Algorithm p.499
15.5.5 Applications p.500
15.6 Sequence Classification p.501
15.6.1 Nearest Neighbor Classifier p.501
15.6.2 Graph-based Methods p.501
15.6.3 Rule-based Methods p.502
15.6.4 Kernel Support Vector Machines p.503
15.6.4.1 Bag-of-Words Kernel p.503
15.6.4.2 Spectrum Kernel p.503
15.6.4.3 Weighted Degree Kernel p.504
15.6.5 Probabilistic Methods: Hidden Markov Models p.504
15.7 Summary p.505
15.8 Bibliographic Notes p.505
15.9 Exercises p.506
```

### 16 Mining Spatial Data

```16.1 Introduction p.509
16.2 Mining with Contextual Spatial Attributes p.510
16.2.1 Shape to Time-Series Transformation p.512
16.2.2 Spatial to Multidimensional Transformation with Wavelets p.515
16.2.3 Spatial Co-location Patterns p.516
16.2.4 Clustering Shapes p.517
16.2.5 Outlier Detection p.518
16.2.5.1 Point Outliers p.518
16.2.5.2 Shape Outliers p.520
16.2.6 Classification of Shapes p.521
16.3 Trajectory Mining p.522
16.3.1 Equivalence of Trajectories and Multivariate Time Series p.522
16.3.2 Converting Trajectories to Multidimensional Data p.523
16.3.3 Trajectory Pattern Mining p.524
16.3.3.1 Frequent Trajectory Paths p.524
16.3.3.2 Co-location Patterns p.526
16.3.4 Trajectory Clustering p.526
16.3.4.1 Computing Similarity between Trajectories p.526
16.3.4.2 Similarity-based Clustering Methods p.527
16.3.4.3 Trajectory Clustering as a Sequence Clustering Problem p.528
16.3.5 Trajectory Outlier Detection p.528
16.3.5.1 Distance-based Methods p.529
16.3.5.2 Sequence-based Methods p.529
16.3.6 Trajectory Classification p.530
16.3.6.1 Distance-based Methods p.530
16.3.6.2 Sequence-based Methods p.530
16.4 Summary p.531
16.5 Bibliographic Notes p.531
16.6 Exercises p.532
```

### 17 Mining Graph Data

```17.1 Introduction p.533
17.2 Matching and Distance Computation in Graphs p.535
17.2.1 Ullman’s Algorithm for Subgraph Isomorphism p.537
17.2.1.1 Algorithm Variations and Refinements p.539
17.2.2 Maximum Common Subgraph Problem p.540
17.2.3 Graph Matching Methods for Distance Computation p.541
17.2.3.1 Maximum Common Subgraph-based Distances p.541
17.2.3.2 Graph Edit Distance p.542
17.3 Transformation-based Distance Computation p.546
17.3.1 Frequent Substructure-based Transformation and Distance Computation p.546
17.3.2 Topological Descriptors p.547
17.3.3 Kernel-based Transformations and Computation p.549
17.3.3.1 Random-Walk Kernels p.549
17.3.3.2 Shortest-Path Kernels p.550
17.4 Frequent Substructure Mining in Graphs p.550
17.4.1 Node-based Join Growth p.552
17.4.2 Edge-based Join Growth p.554
17.4.3 Frequent Pattern Mining to Graph Pattern Mining p.554
17.5 Graph Clustering p.554
17.5.1 Distance-based Methods p.555
17.5.2 Frequent Substructure-based Methods p.555
17.5.2.1 Generic Transformational Approach p.556
17.5.2.2 X Proj: Direct Clustering with Frequent Subgraph Discovery 556
17.6 Graph Classification p.558
17.6.1 Distance-based Methods p.558
17.6.2 Frequent Substructure-based Methods p.558
17.6.2.1 Generic Transformational Approach p.559
17.6.2.2 X Rules: A Rule-based Approach p.559
17.6.3 Kernel Support Vector Machines p.560
17.7 Summary p.560
17.8 Bibliographic Notes p.561
17.9 Exercises p.562
```

### 18 Mining Web Data

```18.1 Introduction p.565
18.2 Web Crawling and Resource Discovery p.567
18.2.1 A Basic Crawler Algorithm p.567
18.2.2 Preferential Crawlers p.569
18.2.4 Combatting Spider Traps p.569
18.2.5 Shingling for Near Duplicate Detection p.570
18.3 Search Engine Indexing and Query Processing p.570
18.4 Ranking Algorithms p.573
18.4.1 Page Rank p.573
18.4.1.1 Topic-Sensitive Page Rank p.576
18.4.1.2 Sim Rank p.577
18.4.2 HITS p.578
18.5 Recommender Systems p.579
18.5.1 Content-based Recommendations p.581
18.5.2 Neighborhood-based Methods for Collaborative Filtering p.582
18.5.2.1 User-based Similarity with Ratings p.582
18.5.2.2 Item-based Similarity with Ratings p.583
18.5.3 Graph-based Methods p.583
18.5.4 Clustering Methods p.585
18.5.5 Latent Factor Models p.586
18.5.5.1 Singular Value Decomposition p.587
18.5.5.2 Matrix Factorization p.587
18.6 Web Usage Mining p.588
18.6.1 Data Preprocessing p.589
18.6.2 Applications p.589
18.7 Summary p.590
18.8 Bibliographic Notes p.591
18.9 Exercises p.591
```

### 19 Social Network Analysis

```19.1 Introduction p.593
19.2 Social Networks: Preliminaries and Properties p.594
19.2.1 Homophily p.595
19.2.2 Triadic Closure and Clustering Coefficient p.595
19.2.3 Dynamics of Network Formation p.595
19.2.4 Power-Law Degree Distributions p.597
19.2.5 Measures of Centrality and Prestige p.597
19.2.5.1 Degree Centrality and Prestige p.597
19.2.5.2 Closeness Centrality and Proximity Prestige p.598
19.2.5.3 Betweenness Centrality p.600
19.2.5.4 Rank Centrality and Prestige p.600
19.3 Community Detection p.601
19.3.1 Kernighan-Lin Algorithm p.602
19.3.1.1 Speeding up Kernighan-Lin p.604
19.3.2 Girvan-Newman Algorithm p.604
19.3.3 Multilevel Graph Partitioning: METIS p.607
19.3.4 Spectral Clustering p.610
19.3.4.1 Important Observations and Intuitions p.613
19.4 Collective Classification p.614
19.4.1 Iterative Classification Algorithm p.615
19.4.2 Label Propagation with Random Walks p.616
19.4.2.1 Iterative Label Propagation: The Spectral Interpretation p.619
19.4.3 Supervised Spectral Methods p.619
19.4.3.1 Supervised Feature Generation with Spectral Embedding p.620
19.4.3.2 Graph Regularization Approach p.620
19.4.3.3 Connections with Random-Walk Methods p.622
19.5.1 Neighborhood-based Measures p.623
19.5.2 Katz Measure p.624
19.5.3 Random Walk-based Measures p.625
19.5.4 Link Prediction as a Classification Problem p.626
19.5.5 Link Prediction as a Missing Value Estimation Problem p.627
19.5.6 Discussion p.627
19.6 Social Influence Analysis p.627
19.6.1 Linear Threshold Model p.629
19.6.3 Influence Function Evaluation p.630
19.7 Summary p.630
19.8 Bibliographic Notes p.631
19.9 Exercises p.632
```

### 20 Privacy-Preserving Data Mining

```20.1 Introduction p.635
20.2 Privacy during Data Collection p.636
20.2.1 Reconstructing Aggregate Distributions p.637
20.2.2 Leveraging Aggregate Distributions for Data Mining p.639
20.3 Privacy-Preserving Data Publishing p.639
20.3.1 The k-anonymity Model p.641
20.3.1.1 Samarati’s Algorithm p.645
20.3.1.2 Incognito p.646
20.3.1.3 Mondrian Multidimensional k-Anonymity p.649
20.3.1.4 Synthetic Data Generation: Condensation-based Approach 651
20.3.2 The ?-diversity Model p.653
20.3.3 The t-closeness Model p.655
20.3.4 The Curse of Dimensionality p.658
20.4 Output Privacy p.658
20.5 Distributed Privacy p.659
20.6 Summary p.661
20.7 Bibliographic Notes p.661
20.8 Exercises p.663
```

## References

;

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 DataMiningTheTextbookData Mining - The Textbook2015
 Author Charu Aggarwal + title Data Mining - The Textbook + year 2015 +