The following are my learnings from Jose course on NLP.

  • use of f string
  • seek(0) takes the iterator to the first position of the file
  • PyPDF2 used to extract text from library
  • Spacy uses one best of breed algo for the specific task
  • NLTK - Released in 2001
  • Spacy - Released in 2015
  • CoreNLP - Package in Java
  • Spacy common tasks
    • Loading the language library
    • Building the pipeline object
    • Using Tokens
    • POS tagging
    • Understanding Token attributes
  • Spacy takes the text and creates a document object
  • Tagger, Parser and NER are the main components of the pipeline
  • Span is another data type - slice of a document
  • One can extract the sentences
  • Tokens are basic building blocks of a sentence
  • Prefix, Suffix, Infix, Exception
  • displacy used for showing dependency trees
  • NTLK has Porter Stemmer, Snowball Stemmer
  • Lemmatization looks beyond word reduction, and considers language full vocabulary. Looks at context words
  • Phrase matching can be done via Spacy
  • Fine grained and Coarse grained POS can be obtained by Spacy
  • One can set custom boundaries for sentence boundary detection
  • CountVectorizer
  • TfidfVectorizer
  • Wordvector via token.vector
  • Sentiment analysis using VADER
  • polarity_score gives the sentiment values of a sentence
  • One can invoke a function based on a sentence to give the compound sentiment
  • Topic Modeling via LDA - You can do it via sklearn
  • Topic Modeling via NMF- You can do it via sklearn
  • Use LSTM for text generation

Here is the completion certificate

img