BERT Neural Network Explained, Transformers
Contents
What did I learn from CodeEmporium about BERT and Transfomers
BERT Neural Network Explained
-
LSTMs are slow(because of sequential processing) and not truly bidirectional
-
Transformers are fast and truly bidirections
- Enable fast processing via parallel architectures
- No activation functions
- Self-Attention mechanism
-
Stack up Encoders - BERT
-
Stack up Decoders - GPT
-
BERT has two components in its training phase
- Next Sentence Prediction
- Masked Language Model
-
Input
- Word vector representations
- Positional Encoding
- Sentence Index
-
Input Format - [CLS] + [Sentence 1] + [SEP] + [Sentence 2]
-
Output Format
[CLS] \( T_1, T_2,\ldots, T_N \) [SEP] \( T_1^\prime, T_2^\prime,\ldots, T_N^\prime \)
NLP with Neural Networks and Transfomers
- Embeddings with Language Models
- ElMO uses two layers of BiDirectional LSTMs
- Open GPT uses a stack of decoders and is very fast
- BERT is bidirectional encoding with transformer architecture
- Fast
- Feed in the entire sentence at one
- Learns context from both directions simulataneously
- XLNET is the another improvement on BERT
- Spacy has many functions that can be used to integrate with BERT, HuggingFace
- HuggingFace has a ton of libraries on Transformers
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
- RNN
- Many to Many Models
- Many to One Models
- One to Many Models
- Transformer components
- Encoder : Input Embedding + Positional Encoder + Multi-Headed Attention Layer + Feed Forward layer
- Decoder : Output Embedding + Positional Encoder + Multi-Headed Attention Layer + Encoder-Decoder Attention + FeedForward Layer
- Q,K,V to come up with Attention
- Layer Normalization
- Transformer code on TensorFlow available to play with