1994 NGramBasedTextCategorization

Jump to: navigation, search

Subject Headings: Text Classification Algorithm, Character N-gram Model.



Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems.

We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8 % correct classification rate on Usenet newsgroup articles written in different languages.

The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80 % correct classification rate. There are also several obvious directions for improving the system's classification performance in those cases where it did not do as well. The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the various categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document's profile and each of the

2. N-Grams

An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:

  • bi-grams: _T, TE, EX, XT, T_
  • tri-grams: _TE, TEX, EXT, XT_, T_ _
  • quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

In general, a string of length [math]k[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.

N-gram-based matching has had some success in dealing with noisy ASCII input in other problem domains, such as in interpreting postal addresses ([1] and [2]), in text retrieval ([3] and [4]), and in a wide variety of other natural language processing applications [5]. The key benefit that N-gram-based matching provides derives from its very nature: since every string is decomposed into small parts, any errors that are present tend to affect only a limited number of those parts, leaving the remainder intact. If we count N-grams that are common to two strings, we get a measure of their similarity that is resistant to a wide variety of textual errors.


  • [1] Cavnar, William B. and Vayda, Alan J., “Using superimposed coding of N-gram lists for Efficient Inexact Matching”, Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C., 1992.
  • [2] Cavnar, William B. and Vayda, Alan J., “Ngram-based matching for multi-field database access in postal applications”, Proceedings of the 1993 Symposium On Document Analysis and Information Retrieval, University of Nevada, Las Vegas.
  • [3] Cavnar, William B., “N-Gram-based Text Filtering For TREC-2,” to appear in the proceedings of The Second Text REtrieval Conference (TREC-2), ed. by, Harman, D.K., NIST, Gaithersburg, Maryland, 1993.
  • [4] Kimbrell, R.E., “Searching for Text? Send and N-gram!,” Byte, May 1988, pp. 297-312.
  • [5] Suen, Ching Y., “N-Gram Statistics for Natural Language Understanding and Text Processing,” IEEE Trans. on Pattern Analysis and Machine Intelligence]], Vol. PAMI- 1, No. 2, April 1979, pp.164-172.
  • [6] Zipf, George K., Human Behavior and the Principle of Least Effort, an Introduction to Human Ecology, Addison-Wesley, Reading, Mass., 1949.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1994 NGramBasedTextCategorizationWilliam B. Cavnar
John M. Trenkle
N-gram-based Text CategorizationProceedings of SDAIR-94http://citeseerx.ist.psu.edu/viewdoc/download?doi=