Word Stemming Task
- AKA: Suffix Stripping, Word Stem Identification, WSIT.
- WSIT(The) ⇒
- WSIT(wolves, went) ⇒
- WSIT(The, wolves, went, to, her, ex-sisters-in-law's, pseudocottage) ⇒
the, wolv, went, to, her, exsistersinlaw, s, pseudocottag.
- WSIT(The, ex-governor, general, 's, sisters-in-law, saw, the, wolves', den, near, Mr., Smith's, home, in, Sault, Ste., Marie.”) ⇒
the, exgovernor, general, s, sistersinlaw, saw, the, wolv, den, near, mr, smith, s, home, in, sault, ste, mari.
- WSIT(The) ⇒
- See: NLP Task.
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Stemming Retrieved:2015-4-11.
- Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form — generally a written word form. The stem needs not to be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
Stemming programs are commonly referred to as stemming algorithms or stemmers.
- Porter Stemmer
- Demo: http://snowball.tartarus.org/demo.php
- We present stemming algorithms, and Snowball stemmers, for English, for Russian, for the Romance languages French, Spanish, Portuguese and Italian, for German and Dutch, for Swedish, Norwegian (bokmål dialect) and Danish, and for Finnish.
- Snowball, and most of the current stemming algorithms were written by Dr Martin Porter, who also prepared the material for the Website. The Snowball to Java codegenerator, and supporting Java libraries, were contributed by Richard Boulton. Dr Andrew Macfarlane, of City University, London, gave much initial encouragement and proofreading assistance.
- (Manning and Schütze, 1999) ⇒ Christopher D. Manning and Hinrich Schütze. (1999). “Foundations of Statistical Natural Language Processing." The MIT Press.
- QUOTE: Extensive empirical research within the Information Retrieval (IR) community has shown that doing stemming does not help the perforamnce of classic IR system when performance is measure as an average over queries (Salton 1989; Hull 1996). There are always some queries for which stemming helps a lot. But there are other where performance goes down. This is a somewhat surprising result, especially from the viewpoint of linguist intuition, and so it is important to understand why that is. There are three main reasons for this.
- (Porter, 1980) ⇒ Martin F. Porter. (1980). “An Algorithm for Suffix Stripping.” In: Program, 14(3):130–137.
- QUOTE: Removing suffixes by automatic means is an operation which is especially useful in the field of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vector of words, or terms. Terms with a common stem will usually have similar meanings, ...
In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means.