2017 Lecture11FurtherTopicsinNeuralM

(Manning & Socher, 2017k) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models.” Lecture in Natural Language Processing with Deep Learning - Stanford CS224N Ling284 (2017).

Subject Headings: LSTM, GRU.

Notes

Lecture Video: https://www.youtube.com/watch?v=6_MO12fPC-0
- Lecture 11 provides a final look at gated recurrent units like GRUs/LSTMs followed by machine translation evaluation, dealing with large vocabulary output, and sub-word and character-based models.
- Also includes research highlight ""Lip reading sentences in the wild.""
- Key phrases: Seq2Seq and Attention Mechanisms, Neural Machine Translation, Speech Processing.
Google Presentation (version): https://docs.google.com/presentation/d/1XJDrDrf8w2yWqdLQtuPzNBpIAxiubcRxHpXH3rl-a2o/

Cited By

http://scholar.google.com/scholar?q=%222017%22+Lecture+11+-+Further+Topics+in+Neural++Machine+Translation+and+Recurrent+Models

2017

(Manning & Socher, 2017j) ⇒ Christopher Manning, and Richard Socher. (2017). “Lecture 10 - Neural Machine Translation and Models with Attention.” In: Natural Language Processing with Deep Learning - Stanford CS224N Ling284.

Quotes

Lecture Plan: Going forwards and backwards

. A final look at gated recurrent units like GRUs/LSTMs
. Research highlight: Lip reading sentences in the wild
. Machine translation evaluation
. Dealing with the large output vocabulary
. Sub-word and character-based models

1. How Gated Units Fix Things – Backpropagation through Time Intuitively, what happens with RNNs?

1. Measure the influence of = @ log p(xt+n|x<t+n) on @g @ht+n @ht+1 ·· · @ht @g @ht+n @ht+n-1

2. How does the perturbation at t

affect

p(xt+n|x<t+n) ?

Backpropagation through Time

Vanishing gradient is super-problematic

When we only observe

1 @h 1 1 Y ✓ @ tanh(a ) ◆1

We cannot tell whether
1. No dependency between t and t+n in data, or
2. Wrong configuration of parameters (the vanishing gradient condition):

emax(U ) < 1

max tanh0(x)

Is the problem with the naïve transition function?

f (ht-1, xt) = tanh(W [xt] + Uht-1 + b)

With it, the temporal derivative is

@ht+1
@ht
= U > @ tanh(a)
@a

It implies that the error must backpropagate through all the intermediate nodes:
Perhaps we can create shortcut connections.
Perhaps we can create adaptive shortcut connections.
Candidate Update h˜t = tanh(W [xt] + Uht-1 + b)
Update gate ut = o(Wu [xt] + Uuht-1 + bu)
Let the net prune unnecessary connections
Candidate Update

h˜t = tanh(W [xt] + U (rt 0 ht-1) + b)

Reset gate
Update gate

rt = o(Wr [xt] + Urht-1 + br) ut = o(Wu [xt] + Uuht-1 + bu)

Gated Recurrent Unit tanh-RNN …

Gated Recurrent Unit

GRU …

Gated recurrent units are much more realistic! Note that there is some overlap in ideas with attention

Two most widely used gated recurrent units

Gated Recurrent Unit

[ Cho et al., EMNLP2014; Chung, Gulcehre, Cho, Bengio, DLUFL2014

ht = ut 0 h˜t + (1 - ut) 0 ht-1

Long Short-Term Memory

[ Hochreiter & Schmidhuber, NC1999; Gers, Thesis2001

ht = ot 0 tanh(ct)
h˜ = tanh(W [xt]+ U (rt 0 ht-1)+ b)
ct = ft 0 ct-1 + it 0 c˜t
ut = o-(Wu [xt]+ Uuht-1 + bu)
rt = o-(Wr [xt]+ Urht-1 + br)

c˜t = tanh(Wc [xt]+ Ucht-1 + bc)
ot = o-(Wo [xt]+ Uoht-1 + bo)  it = o-(Wi [xt]+ Uiht-1 + bi)   ft = o-(Wf [xt]+ Ufht-1 + bf)

The LSTM

The LSTM gates all operations so stuff can be forgotten/ignored rather than it all being crammed on top of everything else

The non-linear update for the next time step is just like an RNN

14

LSTM

This part is the the secret! (Of other recent things like ResNets too!) Rather than multiplying, we get ct by adding the non-linear stuff and ct−1! There is a direct, linear connection between ct and ct−1.

memory

. Use an LSTM or GRU: it makes your life so much simpler!
. Initialize recurrent matrices to be orthogonal
. Initialize other matrices with a sensible (small!) scale
. Initialize forget gate bias to 1: default to remembering
. Use adaptive learning rate algorithms: Adam, AdaDelta, …
. Clip the norm of the gradient: 1–5 seems to be a reasonable threshold when used together with Adam or AdaDelta.
. Either only dropout vertically or learn how to do it right
. Be patient!

[ Saxe et al., ICLR2014; Ba, Kingma, ICLR2015; Zeiler, arXiv2012; Pascanu et al., ICML2013

Ensembling

Train 8–10 nets and average their predictions
It’s easy to do and usually gives good gains!

18

Ensemble of Conditional Recurrent LM

Step-wise Ensemble: p(xens|xens,Y ) = EBM p(xm|xm ,Y )
Ensemble operator implementations
1. . Majority voting scheme (OR): M EBm=1 pens = 1 M pm m=1
2. . Consensus building scheme (AND):

YM

!1/M

27 25 23 BLEU 21 19 17 15

EBm=1p

ens =

pm m=1

Single (med)

OR-Ensemble (8)

En-De En-Cs En-Ru En-Fi [Jung, Cho & Bengio, ACL2016]

Lip Reading Sentences in the Wild Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman Presented by: Michael Fang

+ The cat sat...

Model Architecture: Watch, Listen, Attend and Spell Training Strategies Dataset (Professional-Surpassing!) Results

Watch

The cat sat

Spell

Listen

Slowly increase the length of training sequences Converges training faster, decreases overfitting

Randomly sample from previous prediction instead of ground truth during training Makes training scenario more similar to testing

3. MT Evaluation

Manual (the best!?):
SSER (subjective sentence error rate)
Correct/Incorrect
Adequacy and Fluency (5 or 7 point scales)
Error categorization
Comparative ranking of translations
Testing in an application that uses MT as one sub-component
E.g., question answering from foreign language documents
May not test many aspects of the translation (e.g., cross-lingual IR)
Automatic metric:
WER (word error rate) – why problematic?
BLEU (Bilingual Evaluation Understudy)
N-gram precision (score is between 0 & 1)

– What percent of machine n-grams can be found in the reference translation? – An n-gram is a sequence of n words – For each n-gram size, not allowed to match identical portion of reference translation more than once (two MT words airport are only correct if two reference words airport; can’t cheat by typing out “the the the the the”)

Brevity Penalty

– Can’t just type out single word “the” (precision 1.0!)

It was thought hard to “game” the metric (i.e., to find a way to change MT output so that BLEU goes up, but quality doesn’t)
BLEU is a weighted geometric mean of n- gram precision (is translation in reference?), with a brevity penalty factor added.
BLEU4 counts n-grams ≤ length k = 4

pn = # matched n-gram / # MT n-gram wn = weight, e.g., w1=1, w2=½, w3=¼, w4=⅛ BP = exp(min(0, 1 − (lenref/lenMT)))

log BLEU log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-ref / words-in-MT – 1, 0)

Note: only works at corpus level (zeroes kill it); there’s a smoothed variant for sentence-level

BLEU in Action

枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed . #5 the gunman was shot to death by the police . #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police . #8 the ringer is killed by the police . #9 police killed the gunman . #10

green = 4-gram match (good!) red = word not matched (bad!)

Multiple Reference Translations

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Initial results showed that BLEU predicts human judgments well

2.5

Human Judgments

Slide from G. Doddington (NIST)

Automatic evaluation of MT

People started optimizing their systems to maximize BLEU score
BLEU scores improved rapidly
The correlation between BLEU and human judgments of quality went way, way down
MT BLEU scores now approach those of human translations but their true quality remains far below human translations
Coming up with automatic MT evaluations has become its own research field
There are many proposals: TER, METEOR, MaxSim, SEPIA, our own RTE-MT
TERpA is a representative good one that handles some word choice variation.
MT research requires some automatic metric to allow a rapid development and evaluation cycle.

4. The word generation problem: dealing with a large output vocab

Softmax parameters

Hidden state

Je |V|

P(Je| …)

I am a student _

The word generation problem

Word generation problem

Softmax parameters

Hidden state

Je |V|

P(Je| …)

I am a student _

Softmax computation is expensive.

Word generation problem
If vocabs are modest, e.g., 50K

The ecotax portico in Pont-de-Buis Le portique écotaxe de Pont-de-Buis

I am a student _

The <unk> portico in <unk> Le <unk> <unk> de <unk>

Lots of ideas from the neural LM literature!
Hierarchical models: tree-structured vocabulary
[Morin & Bengio, AISTATS’05], [Mnih & Hinton, NIPS’09].
Complex, sensitive to tree structures.
Noise-contrastive estimation: binary classification
[Mnih & Teh, ICML’12], [Vaswani et al., EMNLP’13].
Different noise samples per training example.*

43 *We’ll mention a simple fix for this!

GPU-friendly.
Training: a subset of the vocabulary at a time.
Testing: smart on the set of possible translations.

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio. On Using Very Large Target Vocabulary for Neural Machine Translation. ACL’15.

Each time train on a smaller vocab Vʹ ≪ V

|Vʹ|

Each time train on a smaller vocab Vʹ ≪ V

|Vʹ|

Partition training data in subsets:
Each subset has 𝜏 distinct target words, |Vʹ| = 𝜏.
Sequentially select examples: |Vʹ| = 5.

Vʹ = {she, loves, cats, he, likes}

Sequentially select examples: |Vʹ| = 5.

Vʹ = {cats, have, tails, dogs, chase}

48

Sequentially select examples: |Vʹ| = 5.

Vʹ = {she, loves, dogs, cats, hate}

Practice: |V| = 500K, |Vʹ| = 30K or 50K.
K most frequent words: unigram prob.
K most frequent words: unigram prob.
Candidate target words
Kʹ choices per source word. Kʹ = 3.

She

loves

cats

Testing – Select candidate words K

Kʹ

She loves

cats

+ = Candidate list

Produce translations within the candidate list
Practice: Kʹ = 10 or 20, K = 15k, 30k, or 50k.
“BlackOut: Speeding up Recurrent Neural Network Language Models with very Large Vocabularies” – [Ji, Vishwanathan, Satish, Anderson, Dubey, ICLR’16].
Good survey over many techniques.
“Simple, Fast Noise Contrastive Estimation for Large RNN Vocabularies” – [Zoph, Vaswani, May, Knight, NAACL’16].
Use the same samples per minibatch. GPU efficient.

2nd thought on word generation

Scaling softmax is insufficient:
New names, new numbers, etc., at test time.
But previous MT models can copy words.
Recall the Pointer Sentinel Mixture Models (Merity et al. 2017) that Richard mentioned
Gulcehre, Ahn, Nallapati, Zhou, Bengio (2016) Pointing the Unknown Words
Caution from Google NMT paper: In principle can train a “copy model” but this approach is both unreliable at scale – the attention mechanism is unstable when the network is deep – and copying may not always be the best strategy for rare words – sometimes transliteration is more appropriate
“Copy” mechanisms are not sufficient.
Transliteration: Christopher ↦ Kryštof
Multi-word alignment: Solar system ↦ Sonnensystem
Need to handle large, open vocabulary
Rich morphology: nejneobhospodařovávatelnějšímu

(“to the worst farmable one”)

Informal spelling: goooooood morning !!!!!
Same seq2seq architecture:
Use smaller units.
[Sennrich, Haddow, Birch, ACL’16a], [Chung, Cho, Bengio, ACL’16].
Hybrid architectures:
RNN for words + something else for characters.
[Costa-Jussà & Fonollosa, ACL’16], [Luong & Manning, ACL’16].
A compression algorithm:
Most frequent byte pair ↦ a new byte.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine 60 Translation of Rare Words with Subword Units. ACL 2016.

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.

61

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.

Dictionary Vocabulary

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.

Dictionary Vocabulary

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.

Dictionary Vocabulary

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.

Dictionary Vocabulary

65 (Example from Sennrich)

A word segmentation algorithm:
Start with a vocabulary of characters.
Most frequent ngram pairs ↦ a new ngram.
Automatically decide vocabs for NMT

66 https://github.com/rsennrich/nematus

GNMT uses a variant of this, the wordpiece model, which is generally similar but uses a greedy approximation to maximizing language model log likelihood to choose the pieces

u n l y (unfortunately)

Character-based LSTM

the

bank

was

closed

u n l y (unfortunately)

Hybrid NMT

A best-of-both-worlds architecture:
Translate mostly at the word level
Only go to the character level when needed.
More than 2 BLEU improvement over a copy mechanism.

Thang Luong and Chris Manning. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. ACL 2016.

Hybrid NMT

Word-level (4 layers)

End-to-end training 8-stacking LSTM layers.

J 0 I I -

2-stage Decoding

t t t t t - t t t t t - -

Word-level beam search

t. tt.

- J 0 I I un <unk> chat - t t t t •

• • ► ►

t t t t t t

• • • ► ►

t t t t t cat un <unk> chat

t t t t t - 72 t t t t t C U t e

2-stage Decoding Init with word hidden states.

Word-level beam search
Char-level beam search for <unk>.
Train on WMT’15 data (12M sentence pairs)
newstest2015

30x data Large vocab + copy mechanism

Train on WMT’15 data (12M sentence pairs)
newstest2015

30x data Large vocab + copy mechanism

source Her 11-year-old daughter , Shani Bart , said it felt a little bit weird human Její jedenáctiletá dcera Shani Bartová prozradila , že je to trochu zvláštní Její <unk> dcera <unk> <unk> řekla , že je to trochu divné

word

hybrid

Její 11-year-old dcera Shani , řekla , že je to trochu divné Její <unk> dcera , <unk> <unk> , řekla , že je to <unk> <unk> Její jedenáctiletá dcera , Graham Bart , řekla , že cítí trochu divný

Word-based: identity copy fails.

source Her 11-year-old daughter , Shani Bart , said it felt a little bit weird human Její jedenáctiletá dcera Shani Bartová prozradila , že je to trochu zvláštní Její <unk> dcera <unk> <unk> řekla , že je to trochu divné

word

hybrid

Její 11-year-old dcera Shani , řekla , že je to trochu divné Její <unk> dcera , <unk> <unk> , řekla , že je to <unk> <unk> Její dcera , Graham Bart , řekla , že cítí trochu divný

Hybrid: correct, 11-year-old – jedenáctiletá.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2017 Lecture11FurtherTopicsinNeuralM	Christopher D. Manning Richard Socher			Lecture 11 - Further Topics in Neural Machine Translation and Recurrent Models						2017