In this brief post, I would like to pen down my thoughts on two aspects: Heuristics and Non-Intepretability of models.

Let’s look at word embedding matrix. If you take a bunch of words and want to build a learning algorithm, the first task is to convert the text in to a bunch of numbers. The two popular algorithms that have revolutionized the field of NLP are Skipgram method and CBOW method. Both involve learning a lower dimensional representation of the word. The dimensions are not interpretable as the dimensions are not unique. The fact that dimensions are not interpretable did not stop someone from developing fantastic applications. Suppose you are in foreign country and you are lost and want to check with someone the correct way to your destination: You flip open your phone, speak your native language and your phone translates the sentence to a foreign language (text/audio), and use it converse with strangers. The job gets done. Do you really care how the word embedding algo is working ? Not really. So, we don’t need to be hung up in intepretability for all applications. In trading for example, if the strategy makes money, you might not care too much about the interpretability of the strategy.

Also, think about the way these word embedding algos are implemented. There are a lot of heuristics that the developers chose to make the algos robust. One is the way you pick training samples from the data: One can sample based on empirical word frequency or sample based on inverse of word frequency.However practitioners have found that it is better to sample based on \( f(w_i)^{3/4} \). This is a heuristic and has no theoretical justification. However this seems to be working well.