# 2015 ANeuralAttentionModelforAbstrac

## Quotes

### Abstract

Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.

ACL materials are Copyright © 1963–2020 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

### 1 Introduction

Summarization is an important challenge of natural language understanding. The aim is to produce a condensed representation of an input text that captures the core meaning of the original. Most successful summarization systems utilize extractive approaches that crop out and stitch together portions of the text to produce a condensed version. In contrast, abstractive summarization attempts to produce a bottom-up summary, aspects of which may not appear as part of the original.

We focus on the task of sentence-level summarization. While much work on this task has looked at deletion-based sentence compression techniques (Knight and Marcu, 2002), among many others), studies of human summarizers show that it is common to apply various other operations while condensing, such as paraphrasing, generalization, and reordering (Jing, 2002). Past work has modeled this abstractive summarization problem either using linguistically-inspired constraints (Dorr et al., 2003; Zajic et al., 2004) or with syntactic transformations of the input text (Cohn and Lapata, 2008; Woodsend et al., 2010). These approaches are described in more detail in Section 6. We instead explore a fully data-driven approach for generating abstractive summaries. Inspired by the recent success of neural machine translation, we combine a neural language model with a contextual input encoder. Our encoder is modeled off of the attention-based encoder of Bahdanau et al. (2014) in that it learns a latent soft alignment over the input text to help inform the summary (as shown in Figure 1). Crucially both the encoder and the generation model are trained jointly on the sentence summarization task. The model is described in detail in Section 3. Our model also incorporates a beam-search decoder as well as additional features to model extractive elements; these aspects are discussed in Sections 4 and 5.

This approach to summarization, which we call Attention-Based Summarization (ABS), incorporates less linguistic structure than comparable abstractive summarization approaches, but can easily scale to train on a large amount of data. Since our system makes no assumptions about the vocabulary of the generated summary it can be trained directly on any document-summary pair[1]. This allows us to train a summarization model for headline-generation on a corpus of article pairs from Gigaword (Graff et al., 2003) consisting of around 4 million articles. An example of generation is given in Figure 2, and we discuss the details of this task in Section 7.

 Input $\left(\mathbf{x}_1, \cdots , \mathbf{x}_{18}\right)$. First sentence of article: russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism Output $\left(\mathbf{y}_1, \cdots , \mathbf{y}_8\right)$. Generated headline: russia calls for joint front against terrorism $\quad\Leftarrow\quad$ g(terrorism, $\mathrm{x}$, for, joint, front, against)

To test the effectiveness of this approach we run extensive comparisons with multiple abstractive and extractive baselines, including traditional syntax-based systems, integer linear program constrained systems, information-retrieval style approaches, as well as statistical phrase-based machine translation. Section 8 describes the results of these experiments. Our approach outperforms a machine translation system trained on the same large-scale dataset and yields a large improvement over the highest scoring system in the DUC-2004 competition.

### 2 Background

We begin by defining the sentence summarization task. Given an input sentence, the goal is to produce a condensed summary. Let the input consist of a sequence of $M$ words $\mathbf{x}_1,\cdots, \mathbf{x}_M$ coming from a fixed vocabulary $\mathcal{V}$ of size $\vert\mathcal{V}\vert= V$. We will represent each word as an indicator vector $\mathbf{x}_i \in {0,1}^V$ for $i \in \{1,\cdots, M\}$, sentences as a sequence of indicators, and $\mathcal{X}$ as the set of possible inputs. Furthermore define the notation $\mathbf{x}_{[i,j,k]}$ to indicate the sub-sequence of elements $i, j, k$.

A summarizer takes $\mathbf{x}$ as input and outputs a shortened sentence $\mathbf{y}$ of length $N < M$. We will assume that the words in the summary also come from the same vocabulary $\mathcal{V}$ and that the output is a sequence $\mathbf{y}_1, \cdots , \mathbf{y}_N$. Note that in contrast to related tasks, like machine translation, we will assume that the output length $N$ is fixed, and that the system knows the length of the summary before generation[2].

Next consider the problem of generating summaries. Define the set $\mathcal{Y} \subset \left(\{0, 1\}^V , \cdots , \{0, 1\}^V \right)$ as all possible sentences of length $N$, i.e. for all $i$ and $\mathbf{y} \in \mathcal{Y}$, $\mathbf{y}_i$ is an indicator. We say a system is abstractive if it tries to find the optimal sequence from this set $\mathcal{Y}$,

 $\underset{\mathbf{y}\in \mathcal{Y}}{\mathrm{arg max}}s\left(\mathbf{x}, \mathbf{y}\right)$ (1)

under a scoring function $s : \mathcal{X} \times \mathcal{Y} \to \R$. Contrast this to a fully extractive sentence summary[3] which transfers words from the input:

 $\underset{m\in\{1,\cdots,M\}^N}{\mathrm{arg max}}\;s\left(\mathbf{x}, \mathbf{x}_{[m_1,\cdots,m_N]}\right)$ (2)

or to the related problem of sentence compression that concentrates on deleting words from the input:

 $\underset{m\in\{1,\cdots M\}^N,m_{i−1}\lt m_i}{\mathrm{arg max}}s\left(\mathbf{x}, \mathbf{x}_{[m_1,\cdots,m_N]}\right)$ (3)

While abstractive summarization poses a more difficult generation challenge, the lack of hard constraints gives the system more freedom in generation and allows it to fit with a wider range of training data.

In this work we focus on factored scoring functions, $s$, that take into account a fixed window of previous words:

 $s(\mathbf{x}, \mathbf{y}) \approx \displaystyle \sum^{N−1}_{i=0} g\left(\mathbf{y}_{i+1}, \mathbf{x}, \mathbf{y}_c\right)$, (4)

where we define $\mathbf{y}_c \triangleq \mathbf{y}_{[i−C+1,\cdots,i]}$ for a window of size $C$.

In particular consider the conditional log-probability of a summary given the input, $s\left(\mathbf{x}, \mathbf{y}\right) = \log p\left(\mathbf{y}\vert \mathbf{x}; \theta\right)$. We can write this as:

$\log p\left(\mathbf{y}\vert\mathbf{x}; \theta\right) \approx \displaystyle\sum^{N−1}_{i=0} \log p\left(\mathbf{y}_{i+1}\vert\mathbf{x}, \mathbf{y}_c; \theta\right)$,

where we make a Markov assumption on the length of the context as size $C$ and assume for $i < 1$, $\mathbf{y}_i$ is a special start symbol $\langle S \rangle$ .

With this scoring function in mind, our main focus will be on modelling the local conditional distribution: $p\left(\mathbf{y}_{i+1}\vert\mathbf{x}, \mathbf{y}_c; \theta\right)$. The next section defines a parameterization for this distribution, in Section 4, we return to the question of generation for factored models, and in Section 5 we introduce a modified factored scoring function.

### 3 Model

The distribution of interest, $p\left(\mathbf{y}_{i+1}\vert\mathbf{x}, \mathbf{y}_c; \theta\right)$, is a conditional language model based on the input sentence $\mathbf{x}$. Past work on summarization and compression has used a noisy-channel approach to split and independently estimate a language model and a conditional summarization model (Banko et al., 2000; Knight and Marcu, 2002; Daumé III and Marcu, 2002), i.e.,

$\underset{\mathbf{y}}{\mathrm{arg max}} \log p\left(\mathbf{y}\vert\mathbf{x}\right)= \underset{\mathbf{y}}{\mathrm{arg max}} \log p\left(\mathbf{y}\right)p\left(\mathbf{x}\vert\mathbf{y}\right)$

where $p(\mathbf{y})$ and $p(\mathbf{x}\vert\mathbf{y})$ are estimated separately. Here we instead follow work in neural machine translation and directly parameterize the original distribution as a neural network. The network contains both a neural probabilistic language model and an encoder which acts as a conditional summarization model.

#### 3.1 Neural Language Model

The core of our parameterization is a language model for estimating the contextual probability of the next word. The language model is adapted from a standard feed-forward neural network language model (NNLM), particularly the class of NNLMs described by Bengio et al. (2003). The full model is:

\begin{align} p\left(\mathbf{y}_{i+1}\vert\mathbf{y}_c, \mathbf{x}; \theta\right) & \propto \exp\left(\mathbf{Vh} +\mathbf{W}enc\left(\mathbf{x}, \mathbf{y}_c\right)\right), \\ \mathbf{\tilde{y}}_c &= \Big[\mathbf{Ey}_{i−C+1}, \cdots ,\mathbf{Ey}_i\Big]\\ \mathbf{h}&= \tanh\left(\mathbf{U\tilde{y}}_c\right). \end{align}

## References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Michele Banko, Vibhu O Mittal, and Michael J Witbrock. 2000. Headline generation based on statistical translation. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 318–325. Association for Computational Linguistics.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

James Clarke and Mirella Lapata. 2008. Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, pages 399–429.

Trevor Cohn and Mirella Lapata. 2008. Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 137–144. Association for Computational Linguistics.

Hal Daumé III and Daniel Marcu. 2002. A noisychannel model for document compression. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 449–456. Association for Computational Linguistics.

Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to headline generation. In: Proceedings of the HLTNAACL 03 on Text summarization workshop-Volume 5, pages 1–8. Association for Computational Linguistics. Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In EMNLP, pages 1481–1491.

David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.

Hongyan Jing. 2002. Using hidden markov modeling to decompose human-written summaries. Computational linguistics, 28(4):527–543. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP, pages 1700–1709.

Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1):91–107.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81.

Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2014. Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval, volume 1. Cambridge university press Cambridge.

Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David Mc- Closky. 2014. The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.

Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the JointWorkshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Association for Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics. Paul Over, Hoa Dang, and Donna Harman. 2007. Duc in context. Information Processing & Management, 43(6):1506–1520. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.

KristianWoodsend, Yansong Feng, and Mirella Lapata. 2010. Generation with quasi-synchronous grammar. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 513– 523. Association for Computational Linguistics. Sander Wubben, Antal Van Den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 1015–1024. Association for Computational Linguistics. Omar Zaidan. 2009. Z-mert: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 91:79–88.

David Zajic, Bonnie Dorr, and Richard Schwartz. 2004. Bbn/umd at duc-2004: Topiary. In Proceedings of the HLT-NAACL 2004 Document Understanding Workshop, Boston, pages 112–119. 389;

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 ANeuralAttentionModelforAbstracA Neural Attention Model for Abstractive Sentence Summarization2015
 Author Alexander M. Rush +, Sumit Chopra + and Jason Weston + title A Neural Attention Model for Abstractive Sentence Summarization + year 2015 +