Word Stemming System

From GM-RKB
Jump to navigation Jump to search

A Word Stemming System is a text processing system that can solve a word stemming task.



References

2018

2009a

  • (Snowball, 2009) ⇒ http://snowball.tartarus.org/texts/stemmersoverview.html
    • We present stemming algorithms, and Snowball stemmers, for English, for Russian, for the Romance languages French, Spanish, Portuguese and Italian, for German and Dutch, for Swedish, Norwegian (bokmål dialect) and Danish, and for Finnish.
    • Snowball, and most of the current stemming algorithms were written by Dr Martin Porter, who also prepared the material for the Website. The Snowball to Java codegenerator, and supporting Java libraries, were contributed by Richard Boulton. Dr Andrew Macfarlane, of City University, London, gave much initial encouragement and proofreading assistance.

2009b

Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137.
Provenance: Written by B. Frakes and C. Cox, 1986.

2006

1980

  • (Porter, 1980) ⇒ Martin F. Porter. (1980). “An algorithm for suffix stripping” (PDF). In: Program, 14(3):130–137.
    • QUOTE: In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.

      Perhaps the best criterion for removing suffixes from two words W1 and W2 to produce a single stem S, is to say that we do so if there appears to be no difference between the two statements `a document is about W1' and `a document is about W2'. So if W1=`CONNECTION' and W2=`CONNECTIONS' it seems very reasonable to conflate them to a single stem. But if W1=`RELATE' and W2=`RELATIVITY' it seems perhaps unreasonable, especially if the document collection is concerned with theoretical physics (...)