1980 AnAlgorithmForSuffixStripping

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Word Stem, Word Stemming Task, Word Stemming Algorithm, Porter Stemmer.

Quotes

1. Introduction

Removing suffixes by automatic means is an operation which is especially useful in the field of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vector of words, or terms. Terms with a common stem will usually have similar meanings, for example:

       CONNECT
       CONNECTED
       CONNECTING
       CONNECTION
       CONNECTIONS

Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT. In addition, the suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.

The nature of the task will vary considerably depending on whether a stem dictionary is being used, whether a suffix list is being used, and of course on the purpose for which the suffix stripping is being done.

In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1980 AnAlgorithmForSuffixStrippingMartin F. PorterAn Algorithm for Suffix Strippinghttp://tartarus.org/~martin/PorterStemmer/def.txt