Attention is all you need
Contents
The following are my learning from the paper titled, Attention is all you need :
- Using RNN’s for language modeling has been particularly painful as they take long time to train and have problems with learning representational encodings all at once
- In Transformer architecture, the number of operations required to relate signals from two arbitrary input or output positions is a constant
- Self-attention is an attention mechanism relating different positions of a single sentence in order to compute a representation of a the sequence
- Transformer is a first transduction model relying entirely on self-attention to compute representations of its inputs and output without using sequence aligned RNNs or convolution networks
- Learnt about the relationship between Induction, Deduction and Transduction
- Induction, derives the function from the given data, i.e. creates an approximating function
- Deduction derives the values of the given functions for points of interest
- Transduction derives the values of an unknown function for points of interest from the data
- Example of Transduction algo is
k-nearest neighbor
algo - A transducer in the context of NLP is defined as a model that outputs one time step for each input time step provided
- Many natural language processing (NLP) tasks can be viewed as transduction problems, that is learning to convert one string into another. Machine translation is a prototypical example of transduction and recent results indicate that Deep RNNs have the ability to encode long source strings and produce coherent translation
- Model Details
- Encoder and Decoder
- Input is a positional encoding + Word Embedding
- Encoder comprises Multi-Head Attention Layer, Residual connection, Feed Forward network and a Layer Normalization layer
- Decoded network comprises Multi-Head Attention Layer + Keys and Values from Encoder and Queries from Decoder
- There are 6 stacks of encoder layers
- There are 6 stacks of decoder layers
- Attention function can be described as a mapping between query and a set of key-value pairs to an output where query, keys, values and output are all vectors
- Scaled Dot product attention
\begin{align} \text{Attention}(Q,K,V) & = \text{softmax} \left( {QK^T \over \sqrt{d_k}} \right) V \end{align}
- MultiHead attention
\begin{align} \text{MultiHeadAttention}(Q,K,V) & = \text{Concat} (\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O \end{align}
where \(\text{head}_i \) corresponds to output from each attention layer
- Encoder contains self-attention layers.
- Decoder contains self-attention layers
- Positional encoding is done via sine and cosine functions
- Why do the authors use self-attention?
- Faster to train than RNN
- Total computational complexity per layer is reduced
- Path length between input and output positions is shorter can compared to RNN
- Training
- Performed on WMT2014 dataset that contains 4.5 million sentence pairs
- Each training batch - 25000 source and target tokens
- Base model training time is 12 hours
- Big Model training time is 3.5 days
- 8 NVIDIA P100 GPUs needed
- Adam optimizer used
- Three types of regularization done - Residual dropouts + dropouts to sum of embeddings and positional encodings in both encoder and decoder layer
- Results
- English to German - BLUE score of 28.4
- English to French - BLUE score of 41
Finally after many weeks, I sat down and read through the entire paper. Of course this would not have been possible with out the input from the following sources
- Jay Allamar post
- Attention is all you need - paper walk through by Yannic Kilcher
- RASA - Attention paper walk through - 4 Videos
- Attention paper walk through by Code Emporium
- LSTM is Dead - Long live Transformers Meet up talk
- ELMO + GPT2 + Transforms - How NLP cracked transfer learning - Jay Allamar
My immediate next steps is to work through Alladdin’s PyTorch videos on Seq to Seq and Attention pytorch codes. Hopefully by understanding their code, I will be able to get a good grasp of the transformer architecture