Data Skeptic - Word Embeddings Lower Bound
Contents
The following are the learnings from a Data Skeptic podcast interview with Kevin Patel:
- Word embedding dimension of 300 is mostly chosen based on intuition
- When a telecom company wanted to analyze the sentiment of the sms messages, they were challenged by the huge 300 dim representation of words. They wanted to have a fewer dimensional representation - more like a 10 dim space. This was a problem as most of the datasets were atleast 100 to 300 dim embedding space
- Till date there has not been any scientific investigation in to the hyperparameter choice
- Kevin Patel and his team investigated on this hyperparameter on
brown
corpus and found that a dimension of 19 was enough to efficiently represent the word vectors inbrown
corpus - The team borrowed concepts from algebraic topology. If there are four points that are equally pairwise same, then one needs to have three dimensional representation of the point in the form of a tetrahedron. If you are going to project in to a 2D space, there will be a loss of information
- Bank can have two meanings - Money in the Bank, Boat landed on a river bank. In the traditional word2vec representations, bank will be collapsed in to one vector based on the training corpus.
- How many dimensions should be chosen for the representation ? The team investigated the lower bounds of the hyperparameter
- The team is now working on the upper bounds
- Intrinsic set of evaluations
- Word pair similarity task(labeled by humans)
- Word analogy
- Word categorization in to some buckets and then verify the clustering
- Extrinsic set of evalutions - Look at the downstream application of embedding layer and then measure the performance
- Regular Tetrahedron requires 3 dim. If you move to 2 dim, one has to break the equidistance points
- Pair Wise equi similar points - What is the minimum dimension of vector space?
- New bounds for equiangular lines - This paper has been used as a guide for evaluating word embeddings
- Train Brown corpus for embedding dimensions of 1, 2,3,…, 19, ….
- Dimension for Brown corpus is about 19
- The team also plans to work on
- Interpretable word embeddings - Is there any interpretation to these individual dimensions?
- How can one empirically validate their lower bound result on bigger datasets?
- How do you use persistent homology to compute the number of neurons needed in the NN?
- Word embeddings from a machine translation
- Most of the work is NP-complete task. Hence the team is constrained to give results for toy datasets
I found the podcast very interesting as I have just started to understand the basic ideas of word embeddings and how they are used in NLP