1998 DistributionalClusteringofWords

(Redirected from Baker & McCallum, 1998)
Jump to: navigation, search

Subject Headings: Supervised Document Classification, Supervised Dimensionality Reduction.


Cited By



This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality-reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy.

Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy - significantly better than Latent Semantic Indexing [6], class-based clustering [1], feature selection by mutual information [23] or Markov-blanket-based feature selection [13]. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1998 DistributionalClusteringofWordsL. Douglas Baker
Andrew McCallum
Distributional Clustering of Words for Text Classification10.1145/290941.2909701998
AuthorL. Douglas Baker + and Andrew Kachites McCallum +
doi10.1145/290941.290970 +
titleDistributional Clustering of Words for Text Classification +
year1998 +