2018 MemoryArchitecturesinRecurrentN

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

We compare and analyze sequential, random access, and stack memory architectures for recurrent neural network language models. Our experiments on the Penn Treebank and Wikitext-2 datasets show that stack-based memory architectures consistently achieve the best performance in terms of held out perplexity. We also propose a generalization to existing continuous stack models (Joulin & Mikolov, 2015; Grefenstette et al., 2015) to allow a variable number of pop operations more naturally that further improves performance. We further evaluate these language models in terms of their ability to capture non-local syntactic dependencies on a subject-verb agreement dataset (Linzen et al., 2016) and establish new state of the art results using memory augmented language models. Our results demonstrate the value of stack-structured memory for explaining the distribution of words in natural language, in line with linguistic theories claiming a context-free backbone for natural language.

1. Introduction

...

* We compare how a recurrent neural network uses a stack memory, a sequential memory cell (i.e., an LSTM memory cell), and a random access memory (i.e., an attention mechanism) for language modeling. Experiments on the Penn Treebank and Wikitext-2 datasets (§3.2) show that both the stack model and the attention-based model outperform the LSTM model with a comparable (or even larger) number of parameters, and that the stack model eliminates the need to tune window size to achieve the best perplexity.

2. Model

Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:

[math]\displaystyle{ \mathbf{m}_t = \displaystyle \sum_{i=t−K}^{t−1} a_i\mathbf{h}_i \quad }[/math], where [math]\displaystyle{ \quad a_i \propto \exp\left(\mathbf{w}_{m,i}\mathbf{h}_i + \mathbf{w}_{m,h} \mathbf{h}_t\right) }[/math]

Such method can be improved further by partitioning $\mathbf{h}$ into a key, value, and predict subvectors (Daniluk et al., 2017)

3. Experiments

4. Conclusion

Acknowledgements

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2018 MemoryArchitecturesinRecurrentNChris Dyer
Phil Blunsom
Wang Ling
Dani Yogatama
Yishu Miao
Gabor Melis
Adhiguna Kuncoro
Memory Architectures in Recurrent Neural Network Language Models2018