Word Embedding-based Lemmatizer

From GM-RKB
Jump to navigation Jump to search

A Word Embedding-based Lemmatizer is a lemmatising system that utilizes word embeddings to solve lemmatisation task.

  • Context:
    • It can leverage machine learning algorithms, such as Random Forest classifiers or neural networks, to learn from large datasets of words and their lemma forms.
    • It can utilize various types of word embeddings, including those generated by models like FastText or Word2Vec, to capture the nuanced meanings of words based on their usage in text.
    • It can be particularly effective in processing languages for which limited resources are available for natural language processing tasks.
    • It can outperform traditional lemmatizers in terms of accuracy, especially in languages with complex morphology or when dealing with domain-specific vocabulary.
    • ...
  • Example(s):
    • The lemmatizer developed by Iskander Akhmetov et al., 2020, which demonstrated significant performance improvements on twenty-two out of twenty-five languages when compared to the UDPipe lemmatizer.
    • Character-level sequence-to-sequence lemmatization models that incorporate subword entities, as explored by Nasser Zalmout & Nizar Habash, 2020, showing the potential of using subword information to enhance lemmatization accuracy.
    • ...
  • Counter-Example(s):
    • A Rule-Based Lemmatizer, which relies on a set of predefined grammatical rules rather than learning from data.
    • A Dictionary-Based Lemmatizer, which looks up words in a precompiled list of words and their lemmas without considering the context in which the word is used.
  • See: Natural Language Processing, Word Embeddings, Machine Learning, Neural Networks, Subword Information.


References

2020

  • (Akhmetov et al., 2020) ⇒ Iskander Akhmetov, Alexandr Pak, Irina Ualiyeva, and Alexander Gelbukh. (2020). "Highly Language-Independent Word Lemmatization Using a Machine-Learning Classifier.” In: Computación y Sistemas, 24(3), pages 1353-1364.
    • NOTE: It demonstrates the effectiveness of a machine-learning-based lemmatizer across multiple languages, highlighting its language-independence and performance comparison with UDPipe, showcasing its utility in diverse linguistic contexts.
    • ABSTRACT: Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.
    • Keywords: Lemmatization; natural language processing; text preprocessing; Random Forest classifier; Decision Tree classifier

2020

  • (Triapitsyn & Larin, 2020) ⇒ A. D. Triapitsyn, and A. I. Larin. (2020). "Designing of a Classifier for the Unstructured Text Formalization Model Based on Word Embedding.” In: 2020 International Conference on Engineering Management of Communication and Technology (EMCTECH), pages 1-5. IEEE.
    • NOTE: It highlights the importance of text preprocessing and word embedding in enhancing the efficiency of algorithms in processing unstructured text, pointing to the development of a novel classifier for text formalization.
    • ABSTRACT: The active use of artificial intelligence technologies has a direct positive impact on the development of society in various areas of human life. The article describes the developed model of processing and formalization of textual unstructured information in the form of a continuous flow of text information taken from the news feed of news agencies. A method for preprocessing text to reduce the execution time of the algorithm and save CPU resources is given. A method for representing words as a real vector is formed using various algorithms for training artificial neural networks and their properties. A model of the first stage of the text information processing system as a subsystem for classifying the subject of a news article text based on a vector representation of words, including a description of the word vectorization algorithm, an example of the type of word structure with a corresponding numeric vector, and a metric that determines the proximity of vectors to each other in space. The results of the experiment are obtained and a method for setting a decision criterion for the implemented classifier is proposed. The area of use of the proposed classifier is the sphere of information security. The results of the experiment can be indicators of the suitability of using the classifier as a definition of the subject of a news article.

2020

  • (Zalmout & Habash, 2020) ⇒ Nasser Zalmout, and Nizar Habash. (2020). "Utilizing Subword Entities in Character-Level Sequence-to-Sequence Lemmatization Models.” In: Proceedings of the 28th International Conference on Computational Linguistics, pages 4676-4682.
    • NOTE: It explores the enhancement of lemmatization accuracy through the integration of subword entities and pretrained embeddings in sequence-to-sequence models, showcasing advancements in character-level lemmatization techniques.
    • ABSTRACT: In this paper we present a character-level sequence-to-sequence lemmatization model, utilizing several subword features in multiple configurations. In addition to generic n-gram embeddings (using FastText), we experiment with concatenative (stems) and templatic (roots and patterns) morphological subwords. We present several architectures that embed these features directly at the encoder side, or learn them jointly at the decoder side with a multitask learning architecture. The results indicate that using the generic n-gram embeddings (through FastText) outperform the other linguistically-driven subwords. We use Modern Standard Arabic and Egyptian Arabic as test cases, with up to 22% and 13% relative error reduction, respectively, from a strong baseline. An error analysis shows that our best system is even able to handle word/lemma pairs that are both unseen in the training data.