The Transformer
Contents
The following are the learnings from the podcast:
- Transfer learning entails reusing existing models. Use the model that comes from training on different tasks
- Value delivery through custom feature engineering is not required. Most of the recent successes are in the field of computer vision
- If you do not have a lot of training data, then you can use a model that is already trained on a large image dataset(ImageNet).
- Once the pre-trained model is done, additional layers can be overlaid on that
- Source and Target dataset in language dataset is the same - Next word, neighboring words etc
- How do you determine whether target is reasonable ? Domain Adaptation - Task remains the same - but source and target domain are different
- Transfer from different sentiment categories
- Create a similarity metric and then check whether the tasks are similar
- If the tasks are similar, then one can apply transfer learning
- At a practitioner level, leverage the information from a different domain
- Whether you want to update or keep frozen - Adapt a model to a lot of different tasks - Freeze the model and then create several layers on the top of it
- Is there an imagenet moment around the corner ?
- It is apparent that we have reached ImageNet moment ?
- Directly fine tuning the model or use features of the pre-trained model
- Plethora of pre-trained models
- XLNet
- Domain expertise - Word from pre-trained models
- Leverage the labels of the existing data
- Leverage the data
- Image recognition people vs NLP people
- Language is more challenging
- Deal with different languages
- Learn to a lot more information
- Societal context - Needs to work with data
- Particular parts of the image
- How different images relate to each other ?
- Unlabeled data - We have the ability to get pre-trained information
- Hopefully rely on fewer labels
- Training cross-lingual models
- Universal embedding space
- One of the conceptually simpler approach
- Map all the words in to a constant embedding space
- Train the model on joint features
- Mapping is easier is there is a common language
- Scaling to distant languages is important
- Powerful source dataset is needed
- Difficult of the target task - Reasonably good binary / multi-class classification - 50 examples are good enough. 200 examples are good enough
- Tasks that are more complex - required more training examples
- OpenAI - used tldr
- Transfer learning is useful for many types of tasks
- Applying to the tasks and then use for your own tasks is pretty easy
- Larger tasks - Fine tune in a couple of hours
- More methodological developments are needed
- Can generate datasets
- Improving models and Improving techniques
- Long term dependencies are still difficult to put in place.
- BERT tries to solve this problem by giving a large window for capturing the context
- Short term contextual information
- Exploring other architectures + Exploring challenging datasets
- Near term - Scaling up large training models - More performance. Atleast a couple of larger canonical models
- Making the models smaller. Want to enlarge models to get most of the benefits. Don’t want to deal.
- Lot of datasets on NLP available -
- Developing new datasets are very useful for understanding the shortcomings of the model
Need to work on basics of NN and then move on Transfer learning