2004 TextInducedSpellingCorrection

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Error Correction (TEC) System; Text Induced Spelling Correction (TISC) System.

Notes

Cited By

Quotes

Abstract

We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigrams and word bigrams. It is stored a novel representation based on a purpose-built hashing function, which provides a fast and computationally tractable way of checking whether a particular word form likely constitutes a spelling error and of retrieving correction candidates. The system employs input context and lexicon evidence to automatically propose a limited number of ranked correction candidates when insufficient information for an unambiguous decision on a single correction is available. We describe the implemented prototype and evaluate it on English and Dutch text, containing real-world errors in more or less limited contexts. The results are compared with those of the isolated word spelling checking programs ISPELL and the Microsoft Proofing Tools (MPT).

1 Introduction

The automatic detection and correction of errors is an important problem in the recognition of texts. Textual errors are mainly caused during the recognition process, and they are known as edition errors: insert, delete or change errors. In text recognition systems, the error correction is in part provided by a Contextual Postprocessing (CP). Let [math]\displaystyle{ w = a_1\;a_2 \cdots a_m }[/math] be an observed word which is obtained from a previous stage of the system; being the characters [math]\displaystyle{ a_i (1 \leq i \leq m) }[/math] belong to an alphabet [math]\displaystyle{ \Sigma }[/math]. The objective of the CP is to estimate a word [math]\displaystyle{ \hat{w} }[/math] in a set of words [math]\displaystyle{ D }[/math] (a dictionary) that is the best selection for [math]\displaystyle{ w }[/math], e.g., it minimizes a certain distance function [math]\displaystyle{ d(\hat{w}, w) }[/math] or maximizes the posteriori probability [math]\displaystyle{ P(\hat{w} | w) }[/math]. This problem is referred to as one of text error correction.

(...)

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TextInducedSpellingCorrectionMartin ReynaertText Induced Spelling Correction10.3115/1220355.12204752004