The following are my learning from the paper titled, Attention is all you need :
Using RNN’s for language modeling has been particularly painful as they take long time to train and have problems with learning representational encodings all at once In Transformer architecture, the number of operations required to relate signals from two arbitrary input or output positions is a constant Self-attention is an attention mechanism relating different positions of a single sentence in order to compute a representation of a the sequence Transformer is a first transduction model relying entirely on self-attention to compute representations of its inputs and output without using sequence aligned RNNs or convolution networks Learnt about the relationship between Induction, Deduction and Transduction Induction, derives the function from the given data, i.