Sequence to Sequence Learning with Neural Networks

The following are my learning from the paper titled, Sequence to Sequence Learning with Neural Networks :

History
- 1990’s Statistical Machine Translation system
  - Phrase based MT
  - Syntax based MT
  - Semantics based MT
  - Translate and put the lego blocks together
- 2007 - Google translate was introduced
- 2014 - First NMT paper
- 2016 - Replace SMT with NMT method
The key idea behind the method is to use LSTM’s to form a encoder-decoder architecture so that input of any length can be mapped in to a fixed dimensional vector. This fixed dimensional vector can then used in the decoding phase to generate the translation
DNNs can be only applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality
Model
- Ensemble of 5 deep LSTM’s with 380 Million parameters
- Two different LSTMs - one for encoder and other for decoder
- Deep LSTMs better than Shallow LSTMs
- First LSTM encodes and Second LSTM decodes
- Beam Search in the decoder that maintains a small number of partial hypothesis
- Beam Size of 1 is also fine

\begin{align} p(y_1, y_2, \ldots, y_{T^{\prime}} | x_1, \ldots, x_T) = \prod^{T^\prime}_{t=1} p(y_t|v, y_1, y_2, \ldots y_{t-1}) \end{align}

Dataset details
- WMT14 English to French task
- 12 Million sentences
- 348 Million French words
- 304 Million English words
- 160k English vocab size
- 80k French vocab size
Training
- LSTMs with 4 layers
- 1000 cells in each layer
- 1000 embedding dimension
- Input vocab = 160000
- Output vocab = 80000
- 380 Million param
- 64 are pure recurrent connections
- SGD with learning rate of 0.7
- Batch size of 128
- Hard constraint on the norm of the gradient
- Sort with in the batch
- 10 days of training time with 8 GPU machine
- SGD without momentum
- Gradient Clipping
- C++ implementation
Performance
- BLEU score of 34.81 vs Phrase based SMT’s BLEU score of 33.3
- Clustering based on word order rather than bag of words
Reversed the order of sentences to capture the output in a better way
Used Beam Search
Mentions Perplexity - need to understand its implementation
Mentions BLEU score - Need to understand its implementation
LSTM encoder-decoder with limited vocab performs better than STM with unlimited vocab
One of the first papers on Neuro Machine Translation

Contents