2019 SpellingCorrectionAsaForeignLan

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Spelling Error Correction System; Encoder-Decoder Neural Network.

Notes

Cited By

Quotes

Author Keywords

Abstract

In this paper, we reformulated the spell correction problem as a machine translation task under the encoder-decoder framework. This reformulation enabled us to use a single model for solving the problem that is traditionally formulated as learning a language model and an error model. This model employs multi-layer recurrent neural networks as an encoder and a decoder. We demonstrate the effectiveness of this model using an internal dataset, where the training data is automatically obtained from user logs. The model offers competitive performance as compared to the state of the art methods but does not require any feature engineering nor hand tuning between models.

1 Introduction

2 Related Work

3 Background And Preliminaries

2019 SpellingCorrectionAsaForeignLan Fig1.png
Figure 1: Encoder-Decoder with attention framework used for spelling correction. The encoder is a multi-layer recurrent neural network, the first layer of encoder is a bidirectional recurrent neural network. The attention model produces a context vector $C_i$ based on all encoding hidden states $h_i$ and previous decoding state $s_{i-1}$. The decoder is a multi-layer recurrent neural network, and the decoding output $Y_i$ depend both on the context vector $C_i$ and the previous inputs $y_1, \cdots, y_{i-1}$.

4 Spelling Correction As A Foreign Language

5 Experiments

We test our model in the setting of correcting e-commerce queries. Unlike machine translation problem, there is no public datasets for e-commerce spelling correction, and therefore we collect both training and evaluation data internally. For training data, we use the event logs that tracks user behavior on an e-commerce website. Our heuristic for finding potential spelling related queries is based on consecutive user actions in one search session. The hypothesis is that users will try to modify the search query until the search result is desirable with the search intent, and from this sequence of action on queries we can potentially extract the misspelling and correct spelled query pair. Obviously, this includes a lot more diversity on query activities besides spelling mistakes, and thus additional filtering is required to obtain representative data for spelling correction. We use the same techniques as Hasan et al. (2015). Filtering multiple months of data from our data warehouse, we got about 70 million misspelling and spell correction pairs as our training data. For testing, we use the same dataset as in paper Hasan et al. (2015), where it contains 4602 queries and the samples are labeled by human.

 We use beam search to obtain the final result from the model. The result is illustrated in table 1, it is clear that our albeit much simpler, our RNN based model offers competitive performance as compare to the previous methods. It is interesting to note that, the BPE based encoder and decoder performs the best. The better performance may attribute to the shorter resultant sequence as compared to the character case, and possibly more semantic meaningful segments from the sub-words as compared to the characters. Surprisingly, the character based decoder performs quite well considering the complexity of the learning task. This demonstrated the benefit from end-to-end training and the robustness of the framework.

Method Accuracy
Hasan et al.[8] 62.0%
C-Z-W RNN 59.9 %
W-Z-W RNN 62.5 %
C-Z-C RNN 55.1%
Table 1: Results on test dataset with various methods. C-2-C denotes that the model uses character based encoder and decoder; W-2-W denotes that the model uses BPE partial word based encoder and decoder; and C-2-W denotes that the model uses a character based encoder and BPE partial word based decoder.

6 Conclusion

In this paper, we reformulated the spelling correction problem as a machine translation task under the encoder-decoder framework. The reformulation allowed us to use a single model for solving the problem and can be trained from end-to-end. We demonstrate the effectiveness of this model using an internal dataset, where the training data is automatically obtained from user logs. Despite the simplicity of the model, it performed competitively as compared to the state of the art methods that require a lot of feature engineering and human intervention.



References

2019a

2019b

2016a

  • (Eger et al., 2016) ⇒ Steffen Eger, Tim vor der Bruck, and Alexander Mehler (2016). "A Comparison Of Four Character-Level String-To-String Translation Models For (OCR) Spelling Error Correction". The Prague Bulletin of Mathematical Linguistics 105, 1 (2016), 77799.

2016b

2015a

2015b

2014

2013

  • (Raaijmakers, 2013) ⇒ Stephan Raaijmakers (2013). "A Deep Graphical Model For Spelling Correction".

2012

  • (Li et al., 2012) ⇒ Yanen Li, Huizhong Duan, and ChengXiang Zhai (2012). "CloudSpeller: Query Spelling Correction By Using A Unified Hidden Markov Model With Web-Scale Resources". In: Proceedings the 21st International Conference on World Wide Web. ACM, 5617562.

2010

  • (Gao et al., 2010) ⇒ Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun (2010). "A Large Scale Ranker-Based System For Search Query Spelling Correction". In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 3587366.

2009

  • (Whitelaw et al., 2009) ⇒ Casey Whitelaw, Ben Hutchinson, Grace Y Chung, and Gerard Ellis (2009). "Using The Web For Language Independent Spellchecking And Autocorrection". In: Proceedings the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2—Volume 2. Association for Computational Linguistics, 8907899.

2004

  • (Cucerzan & Brill) ⇒ Silviu Cucerzan and Eric Brill (2004). "Spelling Correction As An Iterative Process That Exploits The Collective Knowledge Of Web Users". In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2937300.

2001

2000

  • (Brill & Moore, 2000) ⇒ Eric Brill and Robert C Moore (2000). "An Improved Error Model For Noisy Channel Spelling Correction". In: Proceedings ofthe 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2867293.

1997

1994

1990

BibTeX

@inproceedings{DBLP:conf/sigir/ZhouPK19,
  author    = {Yingbo Zhou and
               Utkarsh Porwal and
               Roberto Konow},
  title     = {Spelling Correction as a Foreign Language},
  booktitle = {Proceedings of the {SIGIR} 2019 Workshop on eCommerce, co-located
               with the 42st International {ACM} {SIGIR} Conference on Research and
               Development in Information Retrieval, eCom@SIGIR 2019, Paris, France,
               July 25, 2019},
  year      = {2019},
  crossref  = {DBLP:conf/sigir/2019ecom},
  url       = {http://ceur-ws.org/Vol-2410/paper28.pdf},
  timestamp = {Fri, 30 Aug 2019 13:15:06 +0200},
  biburl    = {https://dblp.org/rec/bib/conf/sigir/ZhouPK19},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2019 SpellingCorrectionAsaForeignLanYingbo Zhou
Utkarsh Porwal
Roberto Konow
Spelling Correction As a Foreign Language2019