Mapping Dialects with Twitter Data

The following are the learnings from the podcast: Bruno Gonçalves who is now working in JP Morgan chase is a PhD from Emory university He has done some interesting work on looking at all twitter data and look for geographical based patterns. Can one draw a map based on language patterns? 10 TB of data - Twitter Create a huge matrix of latitude and longitude Words and Geolocation matrix pattern matching PCA + Kmeans based clustering based on the patterns in the high dimensional matrix that combines word embeddings and geo location Mobile phones have made marrying the two datasets possible Evolution of language across time can also be done Ton of people working on emoji’s in twitter feed Ton of stuff can be done based on Reuters News and NLP based work

The Transformer

The following are the learnings from the podcast The word “bank” has different meanings in different contexts. It could be a river bank or a financial institution Transformer is a encoder-decoder architecture that makes word embeddings more robust to the context It is a modern NLP technique Attention Is All You Need - A paper that has revolutionized this space The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Named Entity Recognition

Kyle Polich discusses NER in this podcast. My learnings are What is an entity in an unstructured dataset? It depends on the context and the task that the ML algo is trying to accomplish Spacy package is a python package that can do NER NER is used in chatbot applications, semantic search applications Lot of NER packages are good but not great Market research - Parse the brands that were mentioned Wikipedia has a lot of markup - Easy to do NER.

The Death of a Language

Kyle interviews Zane and Leena about the Endangered Languages Project. My learnings are Project is taking in 3.5 hours of audio content from an endangered language called “Ladin” It creates phonetic transcriptions from audio samples of human languages Model has so far produced decent levels of vowel identifications Currently working on phoneme segmentation and larger consonant categories From the project blurb In this project, we are trying to speed up the process of language documentation by building a model that produces phonetic transcriptions from audio samples of human languages.

Sequence to Sequence Models

Kyle Polich discusses sequence to sequence models. The following are the points from the podcast Many approaches of ML suffer from fixed-input-fixed-output Natural Language does not have fixed input and fixed output. Summarizing a paper, cross language translation does not have fixed length input-output What a word means depends on the context. There is an internal state representation that the algo is learning The encoder/decoder architecture has obvious promise for machine translation, and has been successfully applied this way.