2015 APrimeronNeuralNetworkModelsfor

Jump to: navigation, search

Subject Headings: Neural Network-based NLP Algorithm.


Cited By



Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.

1. Introduction

For a long time, core NLP techniques were dominated by machine-learning approaches that used linear models such as support vector machines or logistic regression, trained over very high dimensional yet very sparse feature vectors.

Recently, the field has seen some success in switching from such linear models over sparse inputs to non-linear neural-network models over dense inputs. While most of the neural network techniques are easy to apply, sometimes as almost drop-in replacements of the old linear classifiers, there is in many cases a strong barrier of entry. In this tutorial I attempt to provide NLP practitioners (as well as newcomers) with the basic background, jargon, tools and methodology that will allow them to understand the principles behind the neural network models and apply them to their own work. This tutorial is expected to be self-contained, while presenting the different approaches under a unified notation and framework. It repeats a lot of material which is available elsewhere. It also points to external sources for more advanced topics when appropriate.

This primer is not intended as a comprehensive resource for those that will go on and develop the next advances in neural-network machinery (though it may serve as a good entry point). Rather, it is aimed at those readers who are interested in taking the existing, useful technology and applying it in useful and creative ways to their favourite NLP problems. For more in-depth, general discussion of neural networks, the theory behind them, advanced optimization methods and other advanced topics, the reader is referred to other existing resources. In particular, the book by Bengio et al (2015) is highly recommended.


The focus is on applications of neural networks to language processing tasks. However, some subareas of language processing with neural networks were decidedly left out of scope of this tutorial. These include the vast literature of language modeling and acoustic modeling, the use of neural networks for machine translation, and multi-modal applications combining language and other signals such as images and videos (e.g. caption generation). Caching methods for efficient runtime performance, methods for efficient training with large output vocabularies and attention models are also not discussed. Word embeddings are discussed only to the extent that is needed to understand in order to use them as inputs for other models. Other unsupervised approaches, including autoencoders and recursive autoencoders, also fall out of scope. While some applications of neural networks for language modeling and machine translation are mentioned in the text, their treatment is by no means comprehensive.

A Note on Terminology

The word “feature” is used to refer to a concrete, linguistic input such as a word, a suffix, or a part-of-speech tag. For example, in a first-order partof - speech tagger, the features might be " current word, previous word, next word, previous part of speech”. The term “input vector” is used to refer to the actual input that is fed to the neural-network classifier. Similarly, “input vector entry” refers to a specific value of the input. This is in contrast to a lot of the neural networks literature in which the word “feature” is overloaded between the two uses, and is used primarily to refer to an input-vector entry.

Mathematical Notation

I use bold upper case letters to represent matrices [math](\mathbf{X}, \mathbf{X}, \mathbf{X})[/math], and bold lower-case letters to represent vectors (b). When there are series of related matrices and vectors (for example, where each matrix corresponds to a different layer in the network), superscript indices are used [math](W_1, W_2)[/math]. For the rare cases in which we want indicate the power of a matrix or a vector, a pair of brackets is added around the item to be exponentiated [math](\mathbf{W})^2, (\mathbf{W^3})^2[/math]. Unless otherwise stated, vectors are assumed to be row vectors. We use [math][\mathbf{v}_1,\mathbf{v}_2][/math] to denote vector concatenation.

2. Neural Network Architectures

Neural networks are powerful learning models. We will discuss two kinds of neural network architectures, that can be mixed and matched - feed-forward networks and Recurrent/Recursive networks. Feed-forward networks include networks with fully connected layers, such as the multi-layer perceptron, as well as networks with convolutional and pooling layers. All of the networks act as classifiers, but each with different strengths. Fully connected feed-forward neural networks (Section 4) are non-linear learners that can, for the most part, be used as a drop-in replacement wherever a linear learner is used. This includes binary and multiclass classification problems, as well as more complex structured prediction problems (Section 8). The non-linearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy. A [[series of works (Chen & Manning, 2014; Weiss, Alberti, Collins, & Petrov, 2015; Pei, Ge, & Chang, 2015; Durrett & Klein, 2015) managed to obtain improved syntactic parsing results by simply replacing the linear model of a parser with a fully connected feed-forward network. Straight-forward applications of a feed-forward network as a classifier replacement (usually coupled with the use of pre-trained word vectors) provide benefits also for CCG supertagging (Lewis & Steedman, 2014), dialog state tracking (Henderson, Thomson, & Young, 2013), pre-ordering for statistical machine translation (de Gispert, Iglesias, & Byrne, 2015) and language modeling (Bengio, Ducharme, Vincent, & Janvin, 2003; Vaswani, Zhao, Fossum, & Chiang, 2013). Iyyer et al (2015) demonstrate that multilayer feed-forward networks can provide competitive results on sentiment classification and factoid question answering.

Networks with convolutional and pooling layers (Section 9) are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. For example, in a document classification task, a single key phrase (or an ngram) can help in determining the topic of the document (Johnson & Zhang, 2015). We would like to learn that certain sequences of words are good indicators of the topic, and do not necessarily care where they appear in the document. Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position. Convolutional and pooling architecture show promising results on many tasks, including document classification (Johnson & Zhang, 2015), short-text categorization (Wang, Xu, Xu, Liu, Zhang, Wang, & Hao, 2015a), sentiment classification (Kalchbrenner, Grefenstette, & Blunsom, 2014; Kim, 2014), relation type classification between entities (Zeng, Liu, Lai, Zhou, & Zhao, 2014; dos Santos, Xiang, & Zhou, 2015), event detection (Chen, Xu, Liu, Zeng, & Zhao, 2015; Nguyen & Grishman, 2015), paraphrase identification (Yin & Schutze, 2015) semantic role labeling (Collobert, Weston, Bottou, Karlen, Kavukcuoglu, & Kuksa, 2011), question answering (Dong, Wei, Zhou, & Xu, 2015), predicting box-office revenues of movies based on critic reviews (Bitvai & Cohn, 2015) modeling text interestingness (Gao, Pantel, Gamon, He, & Deng, 2014), and modeling the relation between character-sequences and part-of-speech tags (Santos & Zadrozny, 2014).

In natural language we often work with structured data of arbitrary sizes, such as sequences and trees. We would like to be able to capture regularities in such structures, or to model similarities between such structures. In many cases, this means encoding the structure as a fixed width vector, which we can then pass on to another statistical learner for further processing. While convolutional and pooling architectures allow us to encode arbitrary large items as fixed size vectors capturing their most salient features, they do so by sacrificing most of the structural information. Recurrent (Section 10) and recursive (Section 12) architectures, on the other hand, allow us to work with sequences and trees while preserving a lot of the structural information. Recurrent networks (Elman, 1990) are designed to model sequences, while recursive networks (Goller & Kuchler, 1996) are generalizations of recurrent networks that can handle trees. We will also discuss an extension of recurrent networks that allow them to model stacks (Dyer, Ballesteros, Ling, Matthews, & Smith, 2015; Watanabe & Sumita, 2015).

Recurrent models have been shown to produce very strong results for language modeling, including (Mikolov, Karafiat, Burget, Cernocky, & Khudanpur, 2010; Mikolov, Kombrink, Lukas Burget, Cernocky, & Khudanpur, 2011; Mikolov, 2012; Duh, Neubig, Sudoh, & Tsukada, 2013; Adel, Vu, & Schultz, 2013; Auli, Galley, Quirk, & Zweig, 2013; Auli & Gao, 2014); as well as for sequence tagging (Irsoy & Cardie, 2014; Xu, Auli, & Clark, 2015; Ling, Dyer, Black, Trancoso, Fermandez, Amir, Marujo, & Luis, 2015b), machine translation (Sundermeyer, Alkhouli, Wuebker, & Ney, 2014; Tamura, Watanabe, & Sumita, 2014; Sutskever, Vinyals, & Le, 2014; Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, & Bengio, 2014b]]), dependency parsing (Dyer et al., 2015; Watanabe & Sumita, 2015), sentiment analysis (Wang, Liu, Sun, Wang, & Wang, 2015b), noisy text normalization (Chrupala, 2014), dialog state tracking (Mrksic, O Seaghdha, Thomson, Gasic, Su, Vandyke, Wen, & Young, 2015]]), response generation (Sordoni, Galley, Auli, Brockett, Ji, Mitchell, Nie, Gao, & Dolan, 2015), and modeling the relation between character sequences and part-of-speech tags (Ling et al., 2015b).

Recursive models were shown to produce state-of-the-art or near state-of-the-art results for constituency (Socher, Bauer, Manning, & Andrew Y., 2013) and dependency (Le & Zuidema, 2014; Zhu, Qiu, Chen, & Huang, 2015a) parse re-ranking, discourse parsing (Li, Li, & Hovy, 2014), semantic relation classification (Hashimoto, Miwa, Tsuruoka, & Chikayama, 2013; Liu, Wei, Li, Ji, Zhou, & Wang, 2015), political ideology detection based on parse trees (Iyyer, Enns, Boyd-Graber, & Resnik, 2014b), sentiment classification (Socher, Perelygin, Wu, Chuang, Manning, Ng, & Potts, 2013; Hermann & Blunsom, 2013), target-dependent sentiment classification (Dong, Wei, Tan, Tang, Zhou, & Xu, 2014) and question answering (Iyyer, Boyd-Graber, Claudino, Socher, & Daume III, 2014a).

3 Feature Representation

4 Feed-forward Neural Networks



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 APrimeronNeuralNetworkModelsforYoav GoldbergA Primer on Neural Network Models for Natural Language Processing2015
AuthorYoav Goldberg +
titleA Primer on Neural Network Models for Natural Language Processing +
year2015 +