The following are my learning from the paper titled, Sequence to Sequence Learning with Neural Networks :

  • History
    • 1990’s Statistical Machine Translation system
      • Phrase based MT
      • Syntax based MT
      • Semantics based MT
      • Translate and put the lego blocks together
    • 2007 - Google translate was introduced
    • 2014 - First NMT paper
    • 2016 - Replace SMT with NMT method
  • The key idea behind the method is to use LSTM’s to form a encoder-decoder architecture so that input of any length can be mapped in to a fixed dimensional vector. This fixed dimensional vector can then used in the decoding phase to generate the translation
  • DNNs can be only applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality
  • Model
    • Ensemble of 5 deep LSTM’s with 380 Million parameters
    • Two different LSTMs - one for encoder and other for decoder
    • Deep LSTMs better than Shallow LSTMs
    • First LSTM encodes and Second LSTM decodes
    • Beam Search in the decoder that maintains a small number of partial hypothesis
    • Beam Size of 1 is also fine

\begin{align} p(y_1, y_2, \ldots, y_{T^{\prime}} | x_1, \ldots, x_T) = \prod^{T^\prime}_{t=1} p(y_t|v, y_1, y_2, \ldots y_{t-1}) \end{align}

  • Dataset details

    • WMT14 English to French task
    • 12 Million sentences
    • 348 Million French words
    • 304 Million English words
    • 160k English vocab size
    • 80k French vocab size
  • Training

    • LSTMs with 4 layers
    • 1000 cells in each layer
    • 1000 embedding dimension
    • Input vocab = 160000
    • Output vocab = 80000
    • 380 Million param
    • 64 are pure recurrent connections
    • SGD with learning rate of 0.7
    • Batch size of 128
    • Hard constraint on the norm of the gradient
    • Sort with in the batch
    • 10 days of training time with 8 GPU machine
    • SGD without momentum
    • Gradient Clipping
    • C++ implementation
  • Performance

    • BLEU score of 34.81 vs Phrase based SMT’s BLEU score of 33.3
    • Clustering based on word order rather than bag of words
  • Reversed the order of sentences to capture the output in a better way

  • Used Beam Search

  • Mentions Perplexity - need to understand its implementation

  • Mentions BLEU score - Need to understand its implementation

  • LSTM encoder-decoder with limited vocab performs better than STM with unlimited vocab

  • One of the first papers on Neuro Machine Translation