This post gives some of the learnings from the deliberate practice on spacy.

What can spacy do ?

  • Spacy can do shallow parsing/Chunking. This entails grouping adjacent tokens in to phrases based on their POS tags. Some of them are noun phrases, verb phrases, prepositional phrases
  • Named Entity Recognition : This entails locating named entities and classifying them in to pre-defined categories
    • Available packages to do NER
      • Stanford NER - Provides sequence models. Train your own models with labeled data to build NER models
      • Spacy - Comes with Out of the box NER tagging
      • NLTK: This involves going through three stages
        • Word Tokenization
        • POS tagging : Download corpora to do POS tagging and NER
        • Chunking: Shallow parsing that uses POS tagging and adds more structure to the sentence
  • verb-phrase detection can be done via textacy
  • Gives dependency parse tree via doc.dep_
  • One can use regex to match spacy docs
  • One can quickly remove stop words, remove punctuation, lemmatize and remove punctuation symbols via spacy
  • tag_ gives fine grained POS
  • pos_ gives coarse grained POS
  • word frequencies can be obtained by passing through Counter object
  • Lemmatization can be done via token.lemma_
  • spacy.lang.en.stop_words.STOP_WORDS gives the list of stop words
  • nlp.vocab gives the list of words present in a specific language
  • Every token as a set of very useful attributes and functions useful in NLP tasks
  • Sentence detection is automatic. One can also tweak it to create custom sentence detections