1995 NoiseReductInStatisticApproachToTextCategorization

Jump to: navigation, search

Subject Headings: Document Clustering; Ontology; Comparison Study.


Cited By



This paper studies noise reduction for computational efficiency improvements in a [[statistical learning method for text categorization]], the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of “non-informative words” from texts before training; the use of a truncated singular value decomposition to cut off noisylatent semantic structures” during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results.


  • Yiming Yang,Chute CG. (1992) A Linear Least Squares Fit mapping method for information retrieval from natural language texts. Proc 14th International Conference on Computational Linguistics (COLING 92), 447-453.
  • Yiming Yang,Chute CG. (1993, July) An application of Least Squares Fit Mapping to text information retrieval. Proc 16th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 93), 281-290.
  • Yiming Yang,Chute CG. (1994) An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS 94): 253-277.
  • Fuhr N, Hartmann S, Lustig G, et al. (1991) AIR/X - a rulebased multistage indexing systems for large subject fields. Proceedings of the RIAO’91, 606-623.
  • Tzeras K, Hartmann S. (1993) Automatic indexing based on Bayesian inference networks. Proc 16th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 93), 22-34.
  • Masand B., Linoff G., Waltz D. (1992) Classifying News Stories using Memory Based Reasoning. 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 92): 59-64.
  • Yang Y. (1994) Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 94): 11-21.
  • Golub GH, Van Loan CE. (1989) Matrix Computations, 2nd Edition. Baltimore, MD: The Johns Hopkins University Press.
  • Dongarra JJ, Moler CB, Bunch JR, Stewart GW. (1979) LINPACK Users’ Guide., Philadelphia, PA: SIAM.
  • Cullum JK and Willoughby RA. (1985) Lanczos Algorithm for Large Symmetric Eigenvalue Computations. Vol.1: Theory. Boston: Birkhauser.
  • Berry MW. (1992) Large-Scale Sparse Singular Value Computations. The International Journal of Super-computer Applications Vol. 6 No.1: 13-49.
  • Lewis DD. (1991) Evaluating Text Categorization: Proceedings of the Speech and Natural LanguageWorkshopAsilomar, Morgan Kaufmann:312-318.
  • Fox EA. (Ed.). (1990) Virginia Disc One. Virginia Polytechnic Institute and State University, Nimbus Records.
  • ACM Guide to Computing Literature Baltimore, MD: Association for Computing Machinery, 1984: 657-658. [15] Buckley C, Salton G, Allan J. (1993) Automatic Retrieval With Locality Information Using SMART. In: DK Harman, Ed. The First Text REtrieval Conference (TREC-1):59-65.
  • Wilbur JW, Sirotkin K. (1992) The automatic identification of stop words. J. Inf. Sci. 18:45–55.
  • Salton G. (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Pennsylvania: Addison-Wesley.
  • Yiming Yang,W.J. Wilbur. (1995) Using Corpus Statistics to Remove Redundant Words in Text Categorization, J Amer Soc Inf Sci (accepted).
  • Jackson JE. (1978) Review of “Methods for statistical data analysis of multivariate observations” Technometrics Vol 20:210-211.
  • Deerwester S., Dumais ST, Furnas GW, Landauer TK, Harshman R. (1990) Indexing by Latent Semantic analysis. J Amer Soc Inf Sci 41, 6, 391-407.
  • Chute CG, Yang Y. (1991) Latent Semantic Indexing of Medical Diagnoses using UMLS Semantic Structures. Proceedings of the Fifteenth Annual Symposium on Computer Applications in Medical Care (SCAMC 91):185-189.
  • Dumais S. (1994) Latent Semantic Indexing (LSI) and TREC- 2. In: DK Harman, Ed. The Second Text REtrieval Conference (TREC-2):105–116.

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1995 NoiseReductInStatisticApproachToTextCategorizationYiming YangNoise Reduction in a Statistical Approach to Text CategorizationProceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrievalhttp://nyc.lti.cs.cmu.edu/yiming/Publications/yang-sigir95.pdf1995