2017 Lecture10NeuralMachineTranslati

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Neural seq2seq, Neural MT, Neural Models with Attention. Manning-Socher Neural Machine Translation System, WMT-14 SMT Shared Task, WMT-15 SMT Shared Task.

Notes

Cited By

2017

Quotes

Lecture Plan
Going forwards and backwards
  1. Translation, Machine Translation, Neural Machine Translation
  2. Research highlight: Google’s new NMT
  3. Sequence models with attention
  4. Sequence model decoders

The classic test of language understanding!

Both language analysis & generation

Big MT needs … for humanity … and commerce

Translation is a US$40 billion a year industry

image

Huge in Europe, growing in Asia

image

Large social/government/military as well as commercial needs

3

Huge commercial use

Google translates over 100 billion words a day Facebook in 2016 rolled out new homegrown MT

“When we turned [MT] off for some people, they

went nuts!”

eBay uses MT to enable cross-border trade

image

http://www.commonsenseadvisory.com/AbstractView.aspx?ArticleID=36540 https://googleblog.blogspot.com/2016/04/ten-years-of-google-translate.html

4 https://techcrunch.com/2016/05/23/facebook-translation/

image

What is Neural MT (NMT)?

Neural Machine Translation is the approach of modeling the entire MT process via one big artificial neural network*

  • But sometimes we compromise this goal a little

5

image

Neural encoder-decoder architectures

image
image

Input text

Encoder Decoder

image Translated text

image
image
−0.2
−0.1
0.1
0.4
−0.3
1.1
6
image

[Allen 1987 IEEE 1st ICNN]

image

3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set

The grandfather offered the little girl a book ➔ El abuelo le ofrecio un libro a la nina pequena

Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words

7

[Chrisman 1992 Connection Science]

Dual-ported RAAM architecture [Pollack 1990 Artificial Intelligence] applied to corpus of 216 parallel pairs of simple En-Es sentences:

You are not angry ⬌Usted no esta furioso

Split 50/50 as train/test, 75% of sentences correctly translated!

8 Modern Sequence Models for NMT

[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]

Je suis étudiant _

Encoder Decoder

I am a student _ Je suis étudiant

9 Modern Sequence Models for NMT

[Sutskever et al. 2014, cf. Bahdanau et al. 2014, et seq.]

The protests escalated over the weekend <EOS>

Translation generated

Encoder: Builds up sentence meaning

0.1

0.3

0.1

-0.4

0.2

0.2

-0.2

-0.1

0.1

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.4

0.4

0.3

-0.2

-0.3

0.1

0.3

-0.1

-0.7

0.1

0.5

0.5

0.9

-0.3

-0.2

0.2

0.6

-0.1

-0.4

0.1

0.2

0.6

-0.1

-0.5

0.1

0.2

-0.8

-0.1

-0.5

0.1

-0.1

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

-0.1

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

-0.4

0.6

-0.1

-0.7

0.1

0.3

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

0.3

0.1

0.4

0.4

-0.1

-0.7

0.1

-0.1

0.6

-0.1

0.3

0.1

-0.2

0.6

-0.1

-0.7

0.1

0.2

0.4

-0.1

0.2

0.1

-0.4

0.6

-0.1

-0.7

0.1

0.3

0.6

-0.1

-0.5

0.1

-0.3

0.5

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

Decoder

0.2

0.6

-0.1

-0.7

0.1

0.4

-0.6

0.2

-0.3

0.4

0.2

-0.3

-0.1

-0.4

0.2

0.2

0.4

0.1

-0.5

-0.2

0.4

-0.2

-0.3

-0.4

-0.2

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

0.2

0.6

-0.1

-0.7

0.1

-0.1

0.3

-0.1

-0.7

0.1

-0.2

0.6

0.1

0.3

0.1

-0.4

0.5

-0.5

0.4

0.1

0.2

0.6

-0.1

-0.7

0.1

Source sentence

Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend

A deep recurrent neural network

Feeding in last word Conditional Recurrent Language Model

Le chat assis sur le tapis.

image ?

The cat sat on the mat.

Encoder

Y

2017-02-16

11 image Recurrent Neural Network Encoder

h0 h1 h2 h3

… h7

Le chat assis .

   Read a source sentence one symbol at a time.
   The last hidden stateY summarizes the entire source sentence.
   Any recurrent activation function can be used:
       Hyperbolic tangent
       tanh
       image
       Gated recurrent unit [Cho et al., 2014]
       Long short-term memory [Sutskever et al., 2014]
       Convolutional network [Kalchbrenner & Blunsom, 2013]
       image
       Decoder: Recurrent Language Model
       The cat sat on
       image
       z0 z1 z2 z3 …
       Y = h7
       The cat sat
   Usual recurrent language model, except
       Transition
       zt = f (zt-1, xt, Y )
       X
       Backpropagation @zt/@Y
   t
   image
   Same learning strategy as usual: MLE with SGD

1

L(✓, D) =

Tn

N

X

N

n=1

t=1

t

1

t-1

X

log p(xn|xn,. .., xn

, Y ) Progress in Machine Translation

[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]

image

Phrase-based SMT Syntax-based SMT Neural MT

25

20

15

10

5

0

2013 2014 2015 2016

From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]

Neural MT went from a fringe research activity in 2014 to the widely-adopted leading way to do MT in 2016. Amazing !

15 Four big wins of Neural MT

   End-to-end training
   All parameters are simultaneously optimized to minimize a loss function on the network’s output
   Distributed representations share strength
   Better exploitation of word and phrase similarities
   Better exploitation of context
   NMT can use a much bigger context – both source and partial target text – to translate more accurately
   image
   More fluent text generation

Deep learning text generation is much higher quality

16

What wasn’t on that list?

   Black box component models for reordering, transliteration, etc.
   Explicit use of syntactic or semantic structures
   image
   Explicit use of discourse structure, anaphora, etc.
   17
   image
   Statistical/Neural Machine Translation
   A marvelous use of big data but….
   1519年600名西班牙人在墨西哥登陆,去征服几百万人口的阿兹特克帝国,初次交锋他们损兵三分之二。
   In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash.
   translate.google.com (2009): 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soldiers against their loss. translate.google.com (2011): 1519 600 Spaniards landed in Mexico, millions of people to
   conquer the Aztec empire, the initial loss of soldiers, two thirds of their encounters.
   image
   translate.google.com (2013): 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds. translate.google.com (2014/15/16): 1519 600 Spaniards landed in Mexico, millions of
   people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.
   translate.google.com (2017): In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec empire, the first confrontation they killed two-thirds.
   image
   Adoption!!! NMT aggressively rolled out by industry!
   2016/02, Microsoft launches deep neural network MT running offline on Android/iOS. [Link to blog]
   2016/08, Systran launches purely NMT model [Link to press release]
   2016/09, Google launches NMT [Link to blog post]
   With much more hype and gross overclaims of equaling human translation quality
   image
   Great New York Times Magazine feature
   Paper on the research: https://arxiv.org/abs/1611.04558
   19
   image
   image
   Google’s Multilingual Neural Machine Translation System:
   Enabling Zero-Shot Translation
   Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean
   Presented by: Emma Peng
   image
   image
   State-of-the-art:
   Neural Machine Translation (NMT)
   image
   image
   Multilingual NMT? Previously …
   Multiple Encoders → Multiple Decoders [1]
   Shared Encoder → Multiple Decoder [2]
   Multiple Encoders → Shared Decoder [3]
   Er-Es Encoder Er-Es Decoder
   Es-Er Encoder
   Er-NL Encoder Er-NL Decoder
   NL-Er Encoder Shared Decoder
   Er-Fr Encoder Er-Fr Decoder
   Fr-Er Encoder
   image
   image
       Simplicity: single model
       Low-resource language improvements
       Zero-shot translation

image

image

Artificial token at the beginning of the input sentence to indicate the target language

Add <2es> to indicate that Spanish is the target language

image

   WMT’14:
       Comparable performance: English → French
       State-of-the-art: English → German, French → English
   WMT’15:
       State-of-the-art: German → English
       image
       image
           Train:
               Portuguese → English, English → Spanish (Model 1)
               Or, English ←→ {Portuguese, Spanish} (Model 2)
           Test:
               Portuguese → Spanish Zero-Shot!

Thank you!

   Introducing Attention:
   Vanilla seq2seq & long sentences
   étudiant _
   suis étudiant
   image
   Je suis
   I am a student _ Je
   Problem: fixed-dimensional representation Y
   image
   29
   Attention Mechanism
   Je suis
   Started in computer vision! [Larochelle & Hinton, 2010], [Denil, Bazzani, Larochelle, Freitas, 2012]
   image
   image
   Pool of source states
   I am a student _ Je
   image
       Solution: random access memory
           Retrieve as needed.
           30
           image
           Word alignments
           Le reste
           appartenait
           aux autochtones
           Phrase-based SMT aligned words in a preprocessing-step, usually using EM
           The balance
           was the territory
           of the aboriginal
           people
           Le reste
           image
           image
           image appartenait
           image
           image
           image
           aux autochtones
           The balance
           image
           was the territory
           of the aboriginal
           people
           image
           Learning both imageimageimage translation & alignment
           image
           image
           Dzmitry Bahdanau, KyungHuyn Cho, and Yoshua Bengio. Neural Machine
           32 Translation by Jointly Learning to Translate and Align. ICLR’15.
           suis
           Attention Layer
           Context vector
           image
           image
           Attention Mechanism
           ?
           I am a student _ Je
           image
           image
           Simplified version of (Bahdanau et al., 2015)
           33
           suis
           Attention Layer
           Context vector
           image
           image
           Attention Mechanism – Scoring
           3
           ?
           I am a student _ Je
           image
               Compare target and source hidden states.
               34
               suis
               Attention Layer
               Context vector
               image
               image
               Attention Mechanism – Scoring
               3 5
               ?
               I am a student _ Je
               image
               Compare target and source hidden states.
               35
               suis
               Attention Layer
               Context vector
               image
               image
               Attention Mechanism – Scoring
               3 5 1
               ?
               I am a student _ Je
               image
               Compare target and source hidden states.
               36
               suis
               Attention Layer
               Context vector
               image
               image
               Attention Mechanism – Scoring
               3 5 1 1
               ?
               I am a student _ Je
               image
               Compare target and source hidden states.
               37
               suis
               Attention Layer
               Context vector
               image
               image
               Attention Mechanism – Normalization
               image
               0.3 0.5 0.1 0.1
               ?
               I am a student _ Je
               image
               Convert into alignment weights.
               38
               image
               image
               Attention Mechanism – Context
               Context vector
               ?
               I am a student _ Je
               image
               Build context vector: weighted average.
               39
               suis
               image
               image
               Attention Mechanism – Hidden State
               Context vector
               I am a student _ Je
               image
               Compute the next hidden state.
               40
               image
               image
               image
               Simplified mechanism & more functions:
               image
               image
               Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to
               41 Attention-based Neural Machine Translation. EMNLP’15.
               image
               image
               Simplified mechanism & more functions:
               image
               image
               image
               Bilinear form: well-adopted.
               42
               image
               Avoid focusing on everything at each time
               image
               image
               image
               image
               image
               image
               Global: all source states. Local: subset of source states.
               image
               image
               Potential for long sequences!
               Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to
               43 Attention-based Neural Machine Translation. EMNLP’15.
               Attention
               image
               25
               □□□□□
               image
               20
               BLEU
               15
               image
               10 No Attention
               ours, no attn (BLEU 13.9) ours, local−p attn (BLEU 20.9) ours, best system (BLEU 23.0) WMT’14 best (BLEU 20.7)
               Jeans et al., 2015 (BLEU 21.6)
               10 20 30 40 50 60 70
               Sent Lengths
               44
               image
               source Orlando Bloom and Miranda Kerr still love each other
               human Orlando Bloom und Miranda Kerr lieben sich noch immer
               +attn Orlando Bloom und Miranda Kerr lieben einander noch immer .
               base Orlando Bloom und Lucas Miranda lieben einander noch immer .
               image
                   Translates names correctly.
                   45
                   image
                   image
                   source
                   human
                   We ′ re pleased the FAA recognizes that an enjoyable passenger experience is with safety and security , said Roger Dow , CEO of the U.S.
                   not incompatible
                   Travel Association .
                   Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider- spruch zur Sicherheit steht , sagte Roger Dow , CEO der U.S. Travel Association .
                   unvereinbar
                   Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit
                   +attn
                   Sicherheit und Sicherheit die .
                   ist , sagte Roger Dow , CEO der US -
                   base
                   Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der
                   vereinbar
                   US - <unk> .
                   Translates a doubly-negated phrase correctly.
                   46
                   image
                   image
                   source
                   We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible with safety and security , said Roger Dow , CEO of the U.S. Travel Association .
                   im Wider- spruch zur Sicherheit steht
                   Wir freuen uns , dass die FAA erkennt , dass ein angenehmes
                   human
                   Passagiererlebnis nicht
                   Dow , CEO der U.S. Travel Association .
                   , sagte Roger
                   +attn
                   base
                   Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und Sicherheit unvereinbar ist , sagte Roger Dow , CEO der US - die .
                   Wir freuen uns u ̈ber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit Sicherheit und Sicherheit , sagte Roger Cameron , CEO der US - <unk> .
                   Translates a doubly-negated phrase correctly.
               47
               image
               More Attention! The idea of coverage
               Caption generation
               image
               image
               image
               i h?
               How to not miss an mportant image patc
               Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and
                   Tell: Neural Image Caption Generation with Visual Attention. ICML’15
                   Doubly attention
                   image
                   image
                   Per image patch
                   image
                   Sum across
                   image
               Sum to 1 in both dimensions
           image
           xists long tim
           e Coverage sete
           = 1
           caption words
           image
           ago in SMT!
           ≃ 1
           Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio. Show, Attend and
               Tell: Neural Image Caption Generation with Visual Attention. ICML’15
               image
               Extending attention with linguistic ideas previously in alignment models
                   [Tu, Lu, Liu, Liu, Li, ACL’16]: NMT model with coverage-based attention
                   [Cohn, Hoang, Vymolova, Yao, Dyer, Haffari, NAACL’16]: More substantive models of attention using: position (IBM2) + Markov (HMM) + fertility (IBM3-5) + alignment symmetry (BerkeleyAligner)
                   image
                   image
               Per source word
           Source word fertility
           image
   Sequence Model Decoders: Decoding (0) – Exhaustive Search
       Simple and exact decoding algorithm
       Score each and every possible translation
       Pick the best one
       DO NOT EVEN THINK of TRYING IT OUT!*
       The cat sat on
       image
       image
       z0 z1 z2 z3 …
       The cat sat
       h0 h1 h2 h3 … h7
       image
       51
       * Perhaps with quantum computer and quantum annealing?
       Le chat assis .
       One symbol at a time from
       x˜t ⇠ xt|xt-1,. .., x1,Y
       Until
       x˜t = heos
       Repeat
       The cat sat
       image
       Y = h7
       52
       x0|Y x1|x0,Y x2|x1, x0,Y
       image
       z0 z1 z2 z3
       Pros:
       1. Efficient and unbiased (asymptotically exact)
       Cons:
       image
           High variance
           Pretty inefficient
       image
       Y = h7
       The cat sat
       x0|Y x1|x0,Y x2|x1, x0,Y
       z0 z1 z2 z3
       53
       image
       Efficient, but heavily suboptimal search
       Pick the most likely symbol each time
       |
       x˜t = arg max log p(x x<t,Y )
       x
       Until
       Pros:
       x˜t = heos
           Super-efficient
               Both computation and memory
       image
       Cons:
       1. Heavily suboptimal
       54
       image
       Decoding (3)
       – Beam Search
       Pretty, but not very efficient
       Maintain K hypotheses at a time
       1
       2
       t-1
       1
       2
       t-1
       1
       2
       t-1
       Ht-1 = (x˜1, x˜1,. .., x˜1 ), (x˜2, x˜2,. .., x˜2
       ),. .., (x˜K, x˜K,. .., x˜K )
       Expand each hypothesis
       k k k k
       k k k
       k k k
       Ht = (x˜1 , x˜2 ,. .., x˜t-1, v1), (x˜1 , x˜2 ,. .., x˜t-1, v2),. .., (x˜1 , x˜2 ,. .., x˜t-1, v|V |)
       image
       Pick top-K hypotheses from the union
       Ht = [k=1
       Bk,
       where
       K
       0k =1
       k0
       Bk = arg max log p(X˜|Y ), Ak = Ak-1 - Bk-1, and A1 = [K
       Ht .
       X˜ 2Ak
       55
       image
       Decoding (3)
       – Beam Search
       Asymptotically exact, as K
       But, not necessarily monotonic improvement w.r.t. K
       image
       K should be selected to maximize the translation quality on a validation set.
       56
       En-Cz: 12m training sentence pairs

Strategy

  1. Chains

Valid Set

Test Set

NLL

BLEU

NLL

BLEU

Ancestral Sampling

50

22.98

15.64

26.25

16.76

Greedy Decoding

-

27.88

15.50

26.49

16.66

Beamsearch

5

20.18

17.03

22.81

18.56

Beamsearch

10

19.92

17.13

22.44

18.59

   [Cho, arXiv 2016]
       Greedy Search
           Computationally efficient
           Not great quality
       Beam Search
           Computationally expensive
           Not easy to parallelize
           Much better quality
   Beam search with a small beam is de facto standard in NMT

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2017 Lecture10NeuralMachineTranslatiChristopher D. Manning
Richard Socher
Lecture 10 - Neural Machine Translation and Models with Attention2017