Sequence to Sequence Learning with Neural Networks
Contents
The following are my learning from the paper titled, Sequence to Sequence Learning with Neural Networks :
- History
- 1990’s Statistical Machine Translation system
- Phrase based MT
- Syntax based MT
- Semantics based MT
- Translate and put the lego blocks together
- 2007 - Google translate was introduced
- 2014 - First NMT paper
- 2016 - Replace SMT with NMT method
- 1990’s Statistical Machine Translation system
- The key idea behind the method is to use LSTM’s to form a encoder-decoder architecture so that input of any length can be mapped in to a fixed dimensional vector. This fixed dimensional vector can then used in the decoding phase to generate the translation
- DNNs can be only applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality
- Model
- Ensemble of 5 deep LSTM’s with 380 Million parameters
- Two different LSTMs - one for encoder and other for decoder
- Deep LSTMs better than Shallow LSTMs
- First LSTM encodes and Second LSTM decodes
- Beam Search in the decoder that maintains a small number of partial hypothesis
- Beam Size of 1 is also fine
\begin{align} p(y_1, y_2, \ldots, y_{T^{\prime}} | x_1, \ldots, x_T) = \prod^{T^\prime}_{t=1} p(y_t|v, y_1, y_2, \ldots y_{t-1}) \end{align}
-
Dataset details
- WMT14 English to French task
- 12 Million sentences
- 348 Million French words
- 304 Million English words
- 160k English vocab size
- 80k French vocab size
-
Training
- LSTMs with 4 layers
- 1000 cells in each layer
- 1000 embedding dimension
- Input vocab = 160000
- Output vocab = 80000
- 380 Million param
- 64 are pure recurrent connections
- SGD with learning rate of 0.7
- Batch size of 128
- Hard constraint on the norm of the gradient
- Sort with in the batch
- 10 days of training time with 8 GPU machine
- SGD without momentum
- Gradient Clipping
- C++ implementation
-
Performance
- BLEU score of 34.81 vs Phrase based SMT’s BLEU score of 33.3
- Clustering based on word order rather than bag of words
-
Reversed the order of sentences to capture the output in a better way
-
Used Beam Search
-
Mentions Perplexity - need to understand its implementation
-
Mentions BLEU score - Need to understand its implementation
-
LSTM encoder-decoder with limited vocab performs better than STM with unlimited vocab
-
One of the first papers on Neuro Machine Translation