Part-of-Speech Tagging#

In this lesson, we’re going to learn about the textual analysis methods part-of-speech tagging and keyword extraction. These methods will help us computationally parse sentences and better understand words in context.

We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for other languages:

[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.

-Claire Cain Miller, “Ada Lovelace,” NYT Overlooked Obituaries

She “ PRON has AUX thrown VERB her DET magical ADJ spell NOUN around ADP the DET most ADV abstract ADJ of ADP Sciences. PROPN nsubj aux poss amod dobj prep det advmod pobj prep pobj

Why is Part-of-Speech Tagging Useful?#

I don’t mean to go all Language Nerd on you, but parts of speech are important. Even if they seem kind of boring. Parts of speech are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.

By computationally identifying parts of speech, we can start computationally exploring syntax, the relationship between words — rather than only focusing on words in isolation, as we did with tf-idf. Though parts of speech may seem pedantic, they help computers (and us) crack at that ever-elusive abstract noun: meaning.

spaCy and Natural Language Processing (NLP)#

To computationally identify parts of speech, we’re going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more.

Install spaCy#

To use spaCy, we first need to install the library.

!pip install -U spacy

Import Libraries#

Then we’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 400
pd.options.display.max_colwidth =  400

We’re also going to import the Counter module for counting nouns, verbs, adjectives, etc., and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model#

Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

!python -m spacy download en_core_web_sm
Requirement already satisfied: en_core_web_sm==2.1.0 from in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (2.1.0)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.

spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.

Load Language Model#

Once the model is downloaded, we need to load it with spacy.load() and assign it to the variable nlp.

nlp = spacy.load('en_core_web_sm')

Create a Processed spaCy Document#

Whenever we use spaCy, our first step will be to create a processed spaCy document with the loaded NLP model nlp(). Most of the heavy NLP lifting is done in this line of code. After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

To test out spaCy’s part-of-speech tagging, we’ll begin by processing a sample sentence from Ada Lovelace’s obituary:

“[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.

This sentence makes for an interesting example because it is syntactically complex and because it includes contains difficultly ambiguous words such as “spell,” “abstract,” and “force.”

sample = """She “has thrown her magical spell around the most abstract of Sciences."""
document = nlp(sample)

spaCy Part-of-Speech Tagging#






big, old, green, incomprehensible, first



in, to, during



very, tomorrow, down, where, there



is, has (done), will (do), should (do)



and, or, but


coordinating conjunction

and, or, but



a, an, the



psst, ouch, bravo, hello



girl, cat, tree, air, beauty



1, 2017, one, seventy-seven, IV, MMXIV



’s, not,



I, you, he, she, myself, themselves, somebody


proper noun

Mary, John, London, NATO, HBO



., (, ), ?


subordinating conjunction

if, while, that



$, %, §, ©, +, −, ×, ÷, =, :), 😝



run, runs, running, eat, ate, eating






Above is a POS chart taken from spaCy’s website, which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy’s POS tagging in action, we can use the spaCy module displacy on our sample document with the style= parameter set to “dep” (short for dependency parsing):

#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(document, style="dep", options=options)
She “ PRON has AUX thrown VERB her PRON magical ADJ spell NOUN around ADP the DET most ADV abstract ADJ of ADP Sciences. PROPN nsubj aux poss amod dobj prep det amod pobj prep pobj

As you can see, spaCy has correctly identified that “spell” and “force” are nouns in our sample sentence:

for token in document:
    if token.pos_ == "NOUN":
        print(token, token.pos_)
spell NOUN

But if we look at the same words in a different context — in a sentence that I made up — spaCy can identify when these words have changed grammatical roles and meanings.

You shouldn’t force someone to learn how to spell Babbage. They just need practice. You can’t abstract it.

document = nlp("You shouldn't force someone to learn how to spell Babbage. They just need practice. You can't abstract it.")
for token in document:
    if token.pos_ == "VERB":
        print(token, token.pos_)
force VERB
learn VERB
spell VERB
need VERB
abstract VERB

Where previously spaCy had identified “force” and “spell” as nouns, here spaCy correctly identifies the words “force,” “spell,” and “abstract” as verbs.

Get Part-Of-Speech Tags#

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the .pos_ attribute for each token. We can get even finer-grained dependency information with the attribute .dep_.

for token in document:
    print(token.text, token.pos_, token.dep_)
You PRON nsubj
should AUX aux
n't PART neg
someone PRON dobj
to PART aux
learn VERB xcomp
how SCONJ advmod
to PART aux
spell VERB xcomp
Babbage NOUN dobj
. PUNCT punct
They PRON nsubj
just ADV advmod
practice NOUN dobj
. PUNCT punct
You PRON nsubj
ca AUX aux
n't PART neg
abstract VERB ROOT
it PRON dobj
. PUNCT punct

Practicing with Dracula#

filepath = "../texts/literature/Dracula_Bram-Stoker.txt"
document = nlp(open(filepath, encoding="utf-8").read())

Get Adjectives#






big, old, green, incomprehensible, first

To extract and count the adjectives in Dracula, we will follow the same model as above, except we’ll add an if statement that will pull out words only if their POS label matches “ADJ.”

Python Review

While we demonstrate how to extract parts of speech in the sections below, we’re also going to reinforce some integral Python skills. Notice how we use for loops and if statements to .append() specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.

Here we make a list of the adjectives identified in Dracula:

adjs = []
for token in document:
    if token.pos_ == 'ADJ':

Then we count the unique adjectives in this list with the Counter() module:

adjs_tally = Counter(adjs)
[('good', 198),
 ('old', 188),
 ('other', 185),
 ('own', 184),
 ('more', 178),
 ('great', 173),
 ('poor', 171),
 ('little', 164),
 ('dear', 151),
 ('much', 148),
 ('such', 129),
 ('last', 115),
 ('same', 110),
 ('white', 103),
 ('many', 100),
 ('terrible', 99),
 ('full', 97),
 ('long', 90),
 ('few', 86),
 ('strange', 85),
 ('first', 78),
 ('new', 73),
 ('ready', 71),
 ('dead', 69),
 ('red', 67),
 ('whole', 66),
 ('open', 66),
 ('sweet', 65),
 ('dark', 60),
 ('strong', 59),
 ('very', 57),
 ('true', 54),
 ('heavy', 53),
 ('young', 53),
 ('quick', 48),
 ('able', 47),
 ('happy', 47),
 ('right', 47),
 ('asleep', 47),
 ('big', 44),
 ('small', 43),
 ('sure', 43),
 ('better', 43),
 ('best', 41),
 ('cold', 41),
 ('wild', 41),
 ('close', 41),
 ('free', 41),
 ('late', 40),
 ('certain', 40),
 ('present', 40),
 ('afraid', 39),
 ('high', 38),
 ('quiet', 37),
 ('pale', 36),
 ('silent', 35),
 ('glad', 35),
 ('usual', 33),
 ('sad', 33),
 ('possible', 32),
 ('bad', 32),
 ('least', 31),
 ('beautiful', 31),
 ('low', 31),
 ('awful', 31),
 ('thin', 31),
 ('hard', 30),
 ('brave', 30),
 ('alone', 29),
 ('mad', 29),
 ('next', 28),
 ('deep', 28),
 ('anxious', 28),
 ('wonderful', 27),
 ('empty', 27),
 ('electronic', 27),
 ('black', 26),
 ('sharp', 26),
 ('half', 26),
 ('awake', 25),
 ('sudden', 25),
 ('horrible', 25),
 ('necessary', 25),
 ('fair', 25),
 ('safe', 25),
 ('Good', 25),
 ('grim', 24),
 ('bright', 24),
 ('fresh', 24),
 ('tired', 24),
 ('wide', 23),
 ('different', 23),
 ('only', 23),
 ('common', 22),
 ('satisfied', 22),
 ('noble', 21),
 ('short', 21),
 ('enough', 21),
 ('dreadful', 21),
 ('bitter', 21),
 ('weak', 21),
 ('odd', 20),
 ('Poor', 20),
 ('well', 19),
 ('round', 18),
 ('most', 18),
 ('evident', 18),
 ('worse', 18),
 ('ill', 18),
 ('real', 18),
 ('Dead', 18),
 ('simple', 17),
 ('less', 17),
 ('fine', 17),
 ('careful', 17),
 ('earnest', 17),
 ('wrong', 17),
 ('evil', 16),
 ('several', 16),
 ('past', 16),
 ('clever', 16),
 ('hypnotic', 16),
 ('latter', 15),
 ('sleepy', 15),
 ('fierce', 15),
 ('greater', 15),
 ('complete', 15),
 ('second', 15),
 ('angry', 15),
 ('cheerful', 15),
 ('blind', 15),
 ('excellent', 14),
 ('general', 14),
 ('tall', 14),
 ('Last', 14),
 ('soft', 14),
 ('sacred', 14),
 ('worth', 14),
 ('nice', 14),
 ('easy', 13),
 ('green', 13),
 ('blue', 13),
 ('large', 13),
 ('various', 13),
 ('foul', 13),
 ('calm', 13),
 ('fearful', 13),
 ('dearest', 13),
 ('surprised', 13),
 ('alive', 13),
 ('further', 12),
 ('clear', 12),
 ('important', 12),
 ('considerable', 12),
 ('early', 12),
 ('broken', 12),
 ('thick', 12),
 ('impossible', 12),
 ('sane', 12),
 ('nervous', 12),
 ('unhappy', 12),
 ('kind', 12),
 ('personal', 12),
 ('stern', 12),
 ('grateful', 12),
 ('useful', 11),
 ('unknown', 11),
 ('lower', 11),
 ('solemn', 11),
 ('faint', 11),
 ('conscious', 11),
 ('sufficient', 11),
 ('hot', 11),
 ('dangerous', 11),
 ('precious', 11),
 ('lovely', 11),
 ('human', 11),
 ('-', 11),
 ('violent', 11),
 ('grave', 11),
 ('selfish', 11),
 ('special', 11),
 ('public', 11),
 ('rough', 11),
 ('available', 10),
 ('interesting', 10),
 ('endless', 10),
 ('sick', 10),
 ('painful', 10),
 ('main', 10),
 ('broad', 10),
 ('willing', 10),
 ('uneasy', 10),
 ('near', 10),
 ('wise', 10),
 ('mere', 10),
 ('voluptuous', 10),
 ('deadly', 10),
 ('serious', 10),
 ('physical', 10),
 ('grey', 10),
 ('stronger', 10),
 ('tiny', 10),
 ('particular', 10),
 ('exact', 9),
 ('comfortable', 9),
 ('steep', 9),
 ('natural', 9),
 ('double', 9),
 ('mysterious', 9),
 ('excited', 9),
 ('distant', 9),
 ('foreign', 9),
 ('mighty', 9),
 ('lonely', 9),
 ('curious', 9),
 ('extraordinary', 9),
 ('narrow', 9),
 ('dim', 9),
 ('sorry', 9),
 ('previous', 9),
 ('desperate', 9),
 ('light', 9),
 ('unconscious', 9),
 ('worst', 9),
 ('difficult', 9),
 ('active', 9),
 ('tight', 8),
 ('ordinary', 8),
 ('like', 8),
 ('peculiar', 8),
 ('loud', 8),
 ('faithful', 8),
 ('warm', 8),
 ('vague', 8),
 ('helpless', 8),
 ('local', 8),
 ('former', 8),
 ('due', 8),
 ('similar', 8),
 ('resolute', 8),
 ('secret', 8),
 ('miserable', 8),
 ('funny', 8),
 ('absolute', 8),
 ('busy', 8),
 ('strait', 8),
 ('mortal', 8),
 ('spiritual', 8),
 ('eager', 8),
 ('left', 8),
 ('rare', 8),
 ('equal', 8),
 ('bent', 7),
 ('straight', 7),
 ('sharper', 7),
 ('far', 7),
 ('successful', 7),
 ('patient', 7),
 ('single', 7),
 ('courteous', 7),
 ('interested', 7),
 ('lunatic', 7),
 ('proud', 7),
 ('later', 7),
 ('about', 7),
 ('gentle', 7),
 ('More', 7),
 ('churchyard', 7),
 ('beloved', 7),
 ('regular', 7),
 ('accurate', 7),
 ('armed', 7),
 ('weaker', 7),
 ('rusty', 7),
 ('utmost', 7),
 ('mental', 7),
 ('infinite', 7),
 ('dry', 7),
 ('stertorous', 7),
 ('holy', 7),
 ('correct', 6),
 ('pretty', 6),
 ('frightened', 6),
 ('pleasant', 6),
 ('recent', 6),
 ('agonised', 6),
 ('fright', 6),
 ('hollow', 6),
 ('hearty', 6),
 ('English', 6),
 ('sized', 6),
 ('private', 6),
 ('visible', 6),
 ('weary', 6),
 ('bare', 6),
 ('legal', 6),
 ('yellow', 6),
 ('dusty', 6),
 ('upset', 6),
 ('fatal', 6),
 ('wooden', 6),
 ('diabolical', 6),
 ('intense', 6),
 ('favourite', 6),
 ('startled', 6),
 ('intellectual', 6),
 ('suspicious', 6),
 ('ghastly', 6),
 ('immediate', 6),
 ('picturesque', 5),
 ('proper', 5),
 ('hysterical', 5),
 ('ghostly', 5),
 ('jagged', 5),
 ('hearted', 5),
 ('blank', 5),
 ('cool', 5),
 ('swift', 5),
 ('accustomed', 5),
 ('vast', 5),
 ('massive', 5),
 ('likely', 5),
 ('clean', 5),
 ('prepared', 5),
 ('bushy', 5),
 ('cruel', 5),
 ('immense', 5),
 ('True', 5),
 ('startling', 5),
 ('instant', 5),
 ('silver', 5),
 ('castle', 5),
 ('bloody', 5),
 ('third', 5),
 ('brilliant', 5),
 ('golden', 5),
 ('cunning', 5),
 ('sensitive', 5),
 ('Same', 5),
 ('dizzy', 5),
 ('vain', 5),
 ('subtle', 5),
 ('despairing', 5),
 ('earthly', 5),
 ('homicidal', 5),
 ('vital', 5),
 ('concerned', 5),
 ('unusual', 5),
 ('restless', 5),
 ('additional', 5),
 ('loose', 5),
 ('humble', 5),
 ('younger', 5),
 ('pure', 5),
 ('apparent', 5),
 ('hush', 5),
 ('garlic', 5),
 ('pleased', 5),
 ('slow', 5),
 ('original', 5),
 ('individual', 5),
 ('thicker', 5),
 ('known', 4),
 ('dirty', 4),
 ('harmless', 4),
 ('fourth', 4),
 ('sorrowful', 4),
 ('snowy', 4),
 ('delicate', 4),
 ('weird', 4),
 ('keen', 4),
 ('nearer', 4),
 ('courtly', 4),
 ('normal', 4),
 ('charming', 4),
 ('constant', 4),
 ('pointed', 4),
 ('whirl', 4),
 ('content', 4),
 ('ancient', 4),
 ('gay', 4),
 ('mistaken', 4),
 ('locked', 4),
 ('definite', 4),
 ('warlike', 4),
 ('smooth', 4),
 ('sheer', 4),
 ('south', 4),
 ('wicked', 4),
 ('touching', 4),
 ('manifest', 4),
 ('rocky', 4),
 ('safer', 4),
 ('laden', 4),
 ('harsh', 4),
 ('lethal', 4),
 ('useless', 4),
 ('doubtless', 4),
 ('idle', 4),
 ('nearest', 4),
 ('married', 4),
 ('haired', 4),
 ('ashamed', 4),
 ('perfect', 4),
 ('Little', 4),
 ('unselfish', 4),
 ('sceptical', 4),
 ('Sacred', 4),
 ('ye', 4),
 ('firm', 4),
 ('zoöphagous', 4),
 ('anæmic', 4),
 ('sore', 4),
 ('greatest', 4),
 ('entire', 4),
 ('needless', 4),
 ('Russian', 4),
 ('steady', 4),
 ('ignorant', 4),
 ('agonising', 4),
 ('rosy', 4),
 ('strict', 4),
 ('deserted', 4),
 ('padded', 4),
 ('troubled', 4),
 ('unexpected', 4),
 ('reasonable', 4),
 ('slight', 4),
 ('lethargic', 4),
 ('desolate', 4),
 ('deeper', 4),
 ('bewildered', 4),
 ('ole', 4),
 ('tender', 4),
 ('intent', 4),
 ('feeble', 4),
 ('whiter', 4),
 ('unable', 4),
 ('amazed', 4),
 ('harrowing', 4),
 ('closer', 4),
 ('rude', 4),
 ('leaden', 4),
 ('unclean', 4),
 ('hungry', 4),
 ('official', 4),
 ('brute', 4),
 ('highest', 4),
 ('worried', 4),
 ('hellish', 4),
 ('stable', 4),
 ('Turkish', 3),
 ('thirsty', 3),
 ('German', 3),
 ('elderly', 3),
 ('rich', 3),
 ('lofty', 3),
 ('glorious', 3),
 ('purple', 3),
 ('brown', 3),
 ('outer', 3),
 ('eastern', 3),
 ('oppressive', 3),
 ('sandy', 3),
 ('colder', 3),
 ('uncanny', 3),
 ('occasional', 3),
 ('remarkable', 3),
 ('Strange', 3),
 ('needful', 3),
 ('suitable', 3),
 ('solid', 3),
 ('Hungarian', 3),
 ('sleepless', 3),
 ('practical', 3),
 ('formal', 3),
 ('gruesome', 3),
 ('freer', 3),
 ('unlocked', 3),
 ('nineteenth', 3),
 ('Great', 3),
 ('dreamy', 3),
 ('delightful', 3),
 ('repulsive', 3),
 ('shadowy', 3),
 ('intact', 3),
 ('uncertain', 3),
 ('key', 3),
 ('square', 3),
 ('post', 3),
 ('earthy', 3),
 ('powerful', 3),
 ('happier', 3),
 ('redder', 3),
 ('bloated', 3),
 ('quickest', 3),
 ('inclined', 3),
 ('worthy', 3),
 ('balanced', 3),
 ('drunk', 3),
 ('loving', 3),
 ('fit', 3),
 ('nigh', 3),
 ('wakeful', 3),
 ('devouring', 3),
 ('manifold', 3),
 ('wet', 3),
 ('dank', 3),
 ('inner', 3),
 ('foolish', 3),
 ('eyed', 3),
 ('fond', 3),
 ('haggard', 3),
 ('languid', 3),
 ('respectful', 3),
 ('religious', 3),
 ('jealous', 3),
 ('flit', 3),
 ('medical', 3),
 ('advanced', 3),
 ('truest', 3),
 ('appalling', 3),
 ('dull', 3),
 ('sullen', 3),
 ('gravely:--', 3),
 ('thankful', 3),
 ('narcotic', 3),
 ('probable', 3),
 ('nauseous', 3),
 ('sovereign', 3),
 ('bold', 3),
 ('stupid', 3),
 ('frightful', 3),
 ('frantic', 3),
 ('frequent', 3),
 ('decent', 3),
 ('aware', 3),
 ('monstrous', 3),
 ('dreary', 3),
 ('bowed', 3),
 ('negative', 3),
 ('daily', 3),
 ('unholy', 3),
 ('smaller', 3),
 ('actual', 3),
 ('limited', 3),
 ('electric', 3),
 ('huge', 3),
 ('future', 3),
 ('powerless', 3),
 ('gallant', 3),
 ('final', 3),
 ('higher', 3),
 ('Unclean', 3),
 ('orderly', 3),
 ('plain', 3),
 ('wily', 3),
 ('derivative', 3),
 ('applicable', 3),
 ('defective', 3),
 ('western', 2),
 ('splendid', 2),
 ('extreme', 2),
 ('Carpathian', 2),
 ('continuous', 2),
 ('subject', 2),
 ('clumsy', 2),
 ('strangest', 2),
 ('stormy', 2),
 ('fashioned', 2),
 ('coloured', 2),
 ('reticent', 2),
 ('ridiculous', 2),
 ('imperative', 2),
 ('idolatrous', 2),
 ('mixed', 2),
 ('queer', 2),
 ('sympathetic', 2),
 ('pine', 2),
 ('varied', 2),
 ('kindly', 2),
 ('exciting', 2),
 ('slightest', 2),
 ('stranger', 2),
 ('prodigious', 2),
 ('minded', 2),
 ('momentary', 2),
 ('imperious', 2),
 ('moonlit', 2),
 ('bigger', 2),
 ('shaven', 2),
 ('Welcome', 2),
 ('octagonal', 2),
 ('welcome', 2),
 ('top', 2),
 ('arched', 2),
 ('looking', 2),
 ('absent', 2),
 ('opposite', 2),
 ('political', 2),
 ('aged', 2),
 ('friendly', 2),
 ('myriad', 2),
 ('closed', 2),
 ('malignant', 2),
 ('prosaic', 2),
 ('veritable', 2),
 ('fascinating', 2),
 ('fell', 2),
 ('unnatural', 2),
 ('sidelong', 2),
 ('thorough', 2),
 ('hateful', 2),
 ('merciful', 2),
 ('intolerable', 2),
 ('super', 2),
 ('languorous', 2),
 ('slender', 2),
 ('vile', 2),
 ('villainy', 2),
 ('piteous', 2),
 ('ruthless', 2),
 ('tiniest', 2),
 ('aërial', 2),
 ('phantom', 2),
 ('extravagant', 2),
 ('naked', 2),
 ('British', 2),
 ('stately', 2),
 ('out:--', 2),
 ('swollen', 2),
 ('merry', 2),
 ('strained', 2),
 ('cursed', 2),
 ('handsome', 2),
 ('curly', 2),
 ('fancy', 2),
 ('American', 2),
 ('momentous', 2),
 ('playful', 2),
 ('honest', 2),
 ('confused', 2),
 ('ungrateful', 2),
 ('determined', 2),
 ('valuable', 2),
 ('excitable', 2),
 ('secure', 2),
 ('happiest', 2),
 ('sweeter', 2),
 ('grand', 2),
 ('stean', 2),
 ('stubble', 2),
 ('fed', 2),
 ('kitten', 2),
 ('raw', 2),
 ('exceptional', 2),
 ('easier', 2),
 ('giant', 2),
 ('aud', 2),
 ('mild', 2),
 ('downward', 2),
 ('lively', 2),
 ('incredible', 2),
 ('flat', 2),
 ('chief', 2),
 ('impatient', 2),
 ('superstitious', 2),
 ('First', 2),
 ('gone', 2),
 ('furious', 2),
 ('paler', 2),
 ('larger', 2),
 ('routine', 2),
 ('servile', 2),
 ('sublime', 2),
 ('shifty', 2),
 ('bulky', 2),
 ('unhurt', 2),
 ('exhausted', 2),
 ('fat', 2),
 ('functional', 2),
 ('malady', 2),
 ('stiff', 2),
 ('seeming', 2),
 ('frenzied', 2),
 ('recuperative', 2),
 ('beneficial', 2),
 ('alarmed', 2),
 ('beneficent', 2),
 ('stalwart', 2),
 ('answer:--', 2),
 ('cerebral', 2),
 ('Quick', 2),
 ('undone', 2),
 ('pallid', 2),
 ('crimson', 2),
 ('medicinal', 2),
 ('healthy', 2),
 ('poignant', 2),
 ('hospitable', 2),
 ('ead', 2),
 ('ome', 2),
 ('unnecessary', 2),
 ('pronounced', 2),
 ('peaceful', 2),
 ('surgical', 2),
 ('acrid', 2),
 ('sternest', 2),
 ('outstretched', 2),
 ('profound', 2),
 ('so', 2),
 ('unattended', 2),
 ('Unopened', 2),
 ('genial', 2),
 ('spirited', 2),
 ('lips:--', 2),
 ('sterner', 2),
 ('professional', 2),
 ('specific', 2),
 ('mortem', 2),
 ('shocked', 2),
 ('eternal', 2),
 ('direct', 2),
 ('hostile', 2),
 ('youthful', 2),
 ('moral', 2),
 ('logical', 2),
 ('forceful', 2),
 ('harder', 2),
 ('following', 2),
 ('silly', 2),
 ('indicative', 2),
 ('typewritten', 2),
 ('barren', 2),
 ('overwrought', 2),
 ('corporeal', 2),
 ('comparative', 2),
 ('numerous', 2),
 ('northern', 2),
 ('fewer', 2),
 ('horrid', 2),
 ('unhallowed', 2),
 ('Most', 2),
 ('Un', 2),
 ('rational', 2),
 ('puzzled', 2),
 ('frank', 2),
 ('set', 2),
 ('affected', 2),
 ('obedient', 2),
 ('careless', 2),
 ('callous', 2),
 ('livid', 2),
 ('horrified', 2),
 ('carnal', 2),
 ('devilish', 2),
 ('blessed', 2),
 ('Brave', 2),
 ('hideous', 2),
 ('awkward', 2),
 ('chronological', 2),
 ('thoughtful', 2),
 ('ultimate', 2),
 ('central', 2),
 ('neutral', 2),
 ('emotional', 2),
 ('overwhelmed', 2),
 ('appealing', 2),
 ('contemptuous', 2),
 ('apt', 2),
 ('elemental', 2),
 ('positive', 2),
 ('live', 2),
 ('meaner', 2),
 ('idiotic', 2),
 ('relieved', 2),
 ('conventional', 2),
 ('liable', 2),
 ('longer', 2),
 ('respectable', 2),
 ('amenable', 2),
 ('odour', 2),
 ('corrupt', 2),
 ('sound', 2),
 ('false', 2),
 ('indifferent', 2),
 ('unspeakable', 2),
 ('typical', 2),
 ('laconic', 2),
 ('weakest', 2),
 ('paralysed', 2),
 ('pitiful', 2),
 ('simplest', 2),
 ('forgetful', 2),
 ('holiest', 2),
 ('devoted', 2),
 ('Hush', 2),
 ('greenish', 2),
 ('flagged', 2),
 ('radiant', 2),
 ('latest', 2),
 ('sole', 2),
 ('doubtful', 2),
 ('alert', 2),
 ('predestinate', 2),
 ('criminal', 2),
 ('commercial', 2),
 ('Many', 2),
 ('FULL', 2),
 ('online', 2),
 ('readable', 2),
 ('widest', 2),
 ('exempt', 2),
 ('federal', 2),
 ('*', 1),
 ('national', 1),
 ('wildest', 1),
 ('distinct', 1),
 ('eleventh', 1),
 ('imaginative', 1),
 ('mamaliga', 1),
 ('impletata', 1),
 ('unpunctual', 1),
 ('outside', 1),
 ('barbarian', 1),
 ('baggy', 1),
 ('enormous', 1),
 ('Oriental', 1),
 ('separate', 1),
 ('seventeenth', 1),
 ('ungracious', 1),
 ('disagreeable', 1),
 ('sloping', 1),
 ('gable', 1),
 ('bewildering', 1),
 ('rugged', 1),
 ('feverish', 1),
 ('serpentine', 1),
 ('prevalent', 1),
 ('peasant', 1),
 ('crazy', 1),
 ('rolling', 1),
 ('thunderous', 1),
 ('own:--', 1),
 ('universal', 1),
 ('rear', 1),
 ('manageable', 1),
 ('optical', 1),
 ('beetling', 1),
 ('shaggy', 1),
 ('impalpable', 1),
 ('interminable', 1),
 ('customary', 1),
 ('antique', 1),
 ('akin', 1),
 ('door:--', 1),
 ('hasty', 1),
 ('graceful', 1),
 ('discreet', 1),
 ('domed', 1),
 ('astonishing', 1),
 ('coarse', 1),
 ('squat', 1),
 ('rank', 1),
 ('protuberant', 1),
 ('costliest', 1),
 ('fabulous', 1),
 ('Blue', 1),
 ('crowded', 1),
 ('flattering', 1),
 ('smallest', 1),
 ('unchecked', 1),
 ('artificial', 1),
 ('triumphant', 1),
 ('undiscovered', 1),
 ('sided', 1),
 ('cardinal', 1),
 ('gloomy', 1),
 ('mediæval', 1),
 ('straggling', 1),
 ('habitable', 1),
 ('sparkling', 1),
 ('attuned', 1),
 ('saturnine', 1),
 ('conceivable', 1),
 ('preternatural', 1),
 ('remiss', 1),
 ('wretched', 1),
 ('bauble', 1),
 ('annoying', 1),
 ('magnificent', 1),
 ('menial', 1),
 ('tangible', 1),
 ('race:--', 1),
 ('European', 1),
 ('Ugric', 1),
 ('victorious', 1),
 ('unworthy', 1),
 ('dishonourable', 1),
 ('meagre', 1),
 ('thinnest', 1),
 ('inaccessible', 1),
 ('nocturnal', 1),
 ('impregnable', 1),
 ('bygone', 1),
 ('curtainless', 1),
 ('unchanged', 1),
 ('wavy', 1),
 ('musical', 1),
 ('deliberate', 1),
 ('thrilling', 1),
 ('scarlet', 1),
 ('Lower', 1),
 ('lurid', 1),
 ('soulless', 1),
 ('aghast', 1),
 ('unquestionable', 1),
 ('unwound', 1),
 ('sanctuary', 1),
 ('madness', 1),
 ('fearless', 1),
 ('spoken', 1),
 ('smoothest', 1),
 ('letters:--', 1),
 ('cheery', 1),
 ('surest', 1),
 ('sturdy', 1),
 ('studded', 1),
 ('unloaded', 1),
 ('spade', 1),
 ('nebulous', 1),
 ('Quicker', 1),
 ('quicker', 1),
 ('dishevelled', 1),
 ('metallic', 1),
 ('vaporous', 1),
 ('Austrian', 1),
 ('Greek', 1),
 ('circular', 1),
 ('sickly', 1),
 ('heavier', 1),
 ('stony', 1),
 ('genuine', 1),
 ('real:--', 1),
 ('Close', 1),
 ('ponderous', 1),
 ('louder', 1),
 ('angrier', 1),
 ('fuller', 1),
 ('filthy', 1),
 ('clanging', 1),
 ('fast', 1),
 ('assistant', 1),
 ('stenographic', 1),
 ('hurried', 1),
 ('imperturbable', 1),
 ('tough', 1),
 ('psychological', 1),
 ('exquisite', 1),
 ('manly', 1),
 ('sloppy', 1),
 ('rarer', 1),
 ('appetite', 1),
 ('sanguine', 1),
 ('centripetal', 1),
 ('centrifugal', 1),
 ('paramount', 1),
 ('noblest', 1),
 ('romantic', 1),
 ('nicest', 1),
 ('mournful', 1),
 ('gnarled', 1),
 ('brusquely:--', 1),
 ('eatin', 1),
 ('cheap', 1),
 ('dictatorial', 1),
 ('illsome', 1),
 ('ireful', 1),
 ('acant', 1),
 ('slippy', 1),
 ('poorish', 1),
 ('aftest', 1),
 ('bier', 1),
 ('Lively', 1),
 ('dearly', 1),
 ('pious', 1),
 ('tombstean', 1),
 ('paved', 1),
 ('back', 1),
 ('wholesome', 1),
 ('Whole', 1),
 ('rudimentary', 1),
 ('sleek', 1),
 ('unprepared', 1),
 ('tame', 1),
 ('undeveloped', 1),
 ('opiate', 1),
 ('cumulative', 1),
 ('hopeless', 1),

Then we make a dataframe from this list:

Pandas Review

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
adj count
0 good 198
1 old 188
2 other 185
3 own 184
4 more 178
5 great 173
6 poor 171
7 little 164
8 dear 151
9 much 148
10 such 129
11 last 115
12 same 110
13 white 103
14 many 100
15 terrible 99
16 full 97
17 long 90
18 few 86
19 strange 85
20 first 78
21 new 73
22 ready 71
23 dead 69
24 red 67
25 whole 66
26 open 66
27 sweet 65
28 dark 60
29 strong 59
30 very 57
31 true 54
32 heavy 53
33 young 53
34 quick 48
35 able 47
36 happy 47
37 right 47
38 asleep 47
39 big 44
40 small 43
41 sure 43
42 better 43
43 best 41
44 cold 41
45 wild 41
46 close 41
47 free 41
48 late 40
49 certain 40
50 present 40
51 afraid 39
52 high 38
53 quiet 37
54 pale 36
55 silent 35
56 glad 35
57 usual 33
58 sad 33
59 possible 32
60 bad 32
61 least 31
62 beautiful 31
63 low 31
64 awful 31
65 thin 31
66 hard 30
67 brave 30
68 alone 29
69 mad 29
70 next 28
71 deep 28
72 anxious 28
73 wonderful 27
74 empty 27
75 electronic 27
76 black 26
77 sharp 26
78 half 26
79 awake 25
80 sudden 25
81 horrible 25
82 necessary 25
83 fair 25
84 safe 25
85 Good 25
86 grim 24
87 bright 24
88 fresh 24
89 tired 24
90 wide 23
91 different 23
92 only 23
93 common 22
94 satisfied 22
95 noble 21
96 short 21
97 enough 21
98 dreadful 21
99 bitter 21

Get Nouns#






girl, cat, tree, air, beauty

To extract and count nouns, we can follow the same model as above, except we will change our if statement to check for POS labels that match “NOUN”.

nouns = []
for token in document:
    if token.pos_ == 'NOUN':

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
noun count
0 time 385
1 night 314
2 man 251
3 room 231
4 way 223
5 day 218
6 hand 202
7 door 199
8 face 197
9 eyes 188
10 things 171
11 friend 166
12 work 162
13 life 144
14 heart 140
15 men 138
16 place 133
17 house 133
18 window 116
19 sleep 112
20 blood 112
21 one 111
22 moment 106
23 head 104
24 hands 104
25 morning 98
26 thing 91
27 _ 89
28 bed 89
29 death 88
30 mind 87
31 others 82
32 sort 81
33 child 74
34 fear 72
35 case 72
36 husband 72
37 rest 71
38 side 68
39 light 68
40 word 66
41 soul 65
42 world 62
43 part 61
44 days 61
45 box 61
46 ship 61
47 dear 60
48 water 59
49 end 59
50 lips 59
51 woman 57
52 look 57
53 hour 56
54 diary 56
55 horses 56
56 brain 55
57 body 55
58 sun 54
59 air 54
60 times 54
61 voice 52
62 fellow 51
63 words 50
64 earth 50
65 boxes 50
66 trouble 49
67 thought 49
68 mother 48
69 people 47
70 morrow 47
71 silence 47
72 letter 46
73 strength 46
74 cause 46
75 feet 46
76 power 46
77 kind 45
78 home 45
79 women 45
80 wolves 45
81 sunset 44
82 sea 43
83 key 43
84 o'clock 43
85 throat 43
86 patient 43
87 snow 42
88 teeth 42
89 knowledge 42
90 instant 41
91 friends 41
92 matter 41
93 duty 40
94 fire 40
95 none 40
96 coffin 40
97 sight 39
98 minutes 39
99 wind 39

Get Verbs#






run, runs, running, eat, ate, eating

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if statement to match the POS label “VERB”).

Python Review

We can use a list comprehension to get our list of verbs in a single line of code! Closely examine the first line of code below:

verbs = [token.text for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
verb count
0 said 461
1 know 396
2 see 377
3 have 348
4 came 303
5 went 298
6 come 295
7 do 278
8 had 277
9 go 269
10 seemed 240
11 took 223
12 saw 216
13 think 216
14 made 196
15 looked 186
16 was 183
17 tell 175
18 get 168
19 make 163
20 got 156
21 found 154
22 is 153
23 told 144
24 say 141
25 asked 139
26 take 136
27 knew 130
28 done 128
29 find 114
30 let 113
31 want 112
32 began 109
33 put 106
34 thought 105
35 hear 101
36 coming 98
37 seen 95
38 look 94
39 keep 94
40 heard 91
41 looking 89
42 felt 86
43 turned 84
44 left 83
45 stood 80
46 opened 80
47 read 79
48 help 79
49 give 78
50 sleep 78
51 feel 77
52 held 73
53 seems 72
54 are 72
55 lay 72
56 gone 70
57 sat 69
58 ask 68
59 gave 67
60 going 65
61 believe 65
62 seem 64
63 spoke 64
64 try 64
65 has 64
66 set 63
67 fear 63
68 speak 62
69 tried 62
70 write 61
71 did 60
72 fell 57
73 kept 56
74 understand 55
75 passed 55
76 leave 55
77 be 55
78 suppose 53
79 ran 50
80 answered 50
81 grew 49
82 like 48
83 love 48
84 taken 47
85 used 47
86 were 46
87 lost 45
88 called 44
89 die 44
90 says 44
91 stopped 43
92 wanted 43
93 moved 43
94 wish 43
95 wait 42
96 mean 42
97 meet 42
98 given 42
99 laid 42

Keyword Extraction#

Get Sentences with Keyword#

spaCy can also identify sentences in a document. To access sentences, we can iterate through document.sents and pull out the .text of each sentence.

We can use spaCy’s sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below.

With the function find_sentences_with_keyword(), we will iterate through document.sents and pull out any sentence that contains a particular “keyword.” Then we will display these sentence with the keywords bolded.

import re
from IPython.display import Markdown, display
def find_sentences_with_keyword(keyword, document):
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
find_sentences_with_keyword(keyword="telegram", document=document)

“ _telegram from Arthur Holmwood to Quincey P. Morris.

“ _telegram, Arthur Holmwood to Seward.

You must send to me the telegram every day; and if there be cause I shall come again.

telegram, Seward, London, to Van Helsing, Amsterdam._

telegram, Seward, London, to Van Helsing, Amsterdam.

telegram, Seward, London, to Van Helsing, Amsterdam.6 September.–Terrible change for the worse.

I hold over telegram to Holmwood till have seen you.

“I waited till I had seen you, as I said in my telegram.

A telegram came from Van Helsing at Amsterdam whilst I was at dinner, suggesting that I should be at Hillingham to-night, as it might be well to be at hand, and stating that he was leaving by the night mail and would join me early in the morning.

telegram, Van Helsing, Antwerp, to Seward, Carfax._ (Sent to Carfax, Sussex, as no county given; delivered late by twenty-two hours.)

The arrival of Van Helsing’s telegram filled me with dismay.

Did you not get my telegram?” I answered as quickly and coherently as I could that I had only got his telegram early in the morning, and had not lost a minute in coming here, and that I could not make any one in the house hear me.

“ He handed me a telegram:– “Have not heard from Seward for three days, and am terribly anxious. Cannot leave.

“ In the hall I met Quincey Morris, with a telegram for Arthur telling him that Mrs. Westenra was dead; that Lucy also had been ill, but was now going on better; and that Van Helsing and I were with her.

  •   *  _Later._--A sad home-coming in every way--the house empty of the dear soul who was so good to us; Jonathan still pale and dizzy under a slight relapse of his malady; and now a **telegram** from Van Helsing, whoever he may be:--  "You will be grieved to hear that Mrs. Westenra died five days ago, and that Lucy died the day before yesterday.

“ _telegram, Mrs. Harker to Van Helsing.

When we arrived at the Berkeley Hotel, Van Helsing found a telegram waiting for him:– “Am coming up by train.

I have sent a telegram to Jonathan to come on here when he arrives in London from Whitby.

“ About half an hour after we had received Mrs. Harker’s telegram, there came a quiet, resolute knock at the hall door.

Nota bene, in Madam’s telegram he went south from Carfax, that means he went to cross the river, and he could only do so at slack of tide, which should be something before one o’clock.

Lord Godalming went to the Consulate to see if any telegram had arrived for him, whilst the rest of us came on to this hotel–“the Odessus.”

He had four telegrams, one each day since we started, and all to the same effect: that the Czarina Catherine had not been reported to Lloyd’s from anywhere.

He had arranged before leaving London that his agent should send him every day a telegram saying if the ship had been reported.

Daily telegrams to Godalming, but only the same story: “Not yet reported.”

telegram, October 24th.

We were all wild with excitement yesterday when Godalming got his telegram from Lloyd’s.

The telegrams from London have been the same: “no further report.”

  •   *       _28 October._--**telegram**.

Dr. Seward’s Diary. 28 October.–When the telegram came announcing the arrival in Galatz I do not think it was such a shock to any of us as might have been expected.

Get Keyword in Context#

We can also find out about a keyword’s more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what’s called ngrams. “Ngrams” are any sequence of n tokens in a text. They’re an important concept in computational linguistics and NLP. (Have you ever played with Google’s Ngram Viewer?)

Below we’re going to make a list of bigrams, that is, all the two-word combinations from Dracula. We’re going to use these bigrams to find the neighboring words that appear alongside particular keywords.

#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        #Append this word combo to the master list "ngrams"
    return ngrams
bigrams = get_bigrams(tokens_and_labels)

Let’s take a peek at the bigrams:

[[('by', 'ADP'), ('Bram', 'PROPN')],
 [('Bram', 'PROPN'), ('Stoker', 'PROPN')],
 [('Stoker', 'PROPN'), ('This', 'DET')],
 [('This', 'DET'), ('eBook', 'PROPN')],
 [('eBook', 'PROPN'), ('is', 'AUX')],
 [('is', 'AUX'), ('for', 'ADP')],
 [('for', 'ADP'), ('the', 'DET')],
 [('the', 'DET'), ('use', 'NOUN')],
 [('use', 'NOUN'), ('of', 'ADP')],
 [('of', 'ADP'), ('anyone', 'PRON')],
 [('anyone', 'PRON'), ('anywhere', 'ADV')],
 [('anywhere', 'ADV'), ('at', 'ADP')],
 [('at', 'ADP'), ('no', 'DET')],
 [('no', 'DET'), ('cost', 'NOUN')],
 [('cost', 'NOUN'), ('and', 'CCONJ')]]

Now that we have our list of bigrams, we’re going to make a function get_neighbor_words(). This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the pos_label parameter.

def get_neighbor_words(keyword, bigrams, pos_label = None):
    neighbor_words = []
    keyword = keyword.lower()
    for bigram in bigrams:
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        #Check to see if keyword is in the bigram
        if keyword in words:
            for word, label in bigram:
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
    return Counter(neighbor_words).most_common()
get_neighbor_words("telegram", bigrams)
[('a', 6),
 ('from', 3),
 ('seward', 3),
 ('arthur', 2),
 ('the', 2),
 ('to', 2),
 ('my', 2),
 ('i', 2),
 ('came', 2),
 ('helsing', 2),
 ('his', 2),
 ('harker', 2),
 ('morris', 1),
 ('every', 1),
 ('see', 1),
 ('day', 1),
 ('back', 1),
 ('over', 1),
 ('it', 1),
 ('van', 1),
 ('filled', 1),
 ('early', 1),
 ('for', 1),
 ('waiting', 1),
 ('there', 1),
 ('madam', 1),
 ('he', 1),
 ('any', 1),
 ('had', 1),
 ('saying', 1),
 ('masts', 1),
 ('october', 1)]
get_neighbor_words("telegram", bigrams, pos_label='VERB')
[('came', 2), ('see', 1), ('filled', 1), ('waiting', 1), ('saying', 1)]

Your Turn!#

Try out find_sentences_with_keyword() and get_neighbor_words with your own keywords of interest.

find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)