Spacy Deliberate Practice

This post gives some of the learnings from the deliberate practice on spacy.

What can spacy do ?

Spacy can do shallow parsing/Chunking. This entails grouping adjacent tokens in to phrases based on their POS tags. Some of them are noun phrases, verb phrases, prepositional phrases
Named Entity Recognition : This entails locating named entities and classifying them in to pre-defined categories
- Available packages to do NER
  - Stanford NER - Provides sequence models. Train your own models with labeled data to build NER models
  - Spacy - Comes with Out of the box NER tagging
  - NLTK: This involves going through three stages
    - Word Tokenization
    - POS tagging : Download corpora to do POS tagging and NER
    - Chunking: Shallow parsing that uses POS tagging and adds more structure to the sentence
verb-phrase detection can be done via textacy
Gives dependency parse tree via doc.dep_
One can use regex to match spacy docs
One can quickly remove stop words, remove punctuation, lemmatize and remove punctuation symbols via spacy
tag_ gives fine grained POS
pos_ gives coarse grained POS
word frequencies can be obtained by passing through Counter object
Lemmatization can be done via token.lemma_
spacy.lang.en.stop_words.STOP_WORDS gives the list of stop words
nlp.vocab gives the list of words present in a specific language
Every token as a set of very useful attributes and functions useful in NLP tasks
Sentence detection is automatic. One can also tweak it to create custom sentence detections

Contents