Part-of-Speech Tagging#
In this lesson, we’re going to learn about the textual analysis methods part-of-speech tagging and keyword extraction. These methods will help us computationally parse sentences and better understand words in context.
We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for other languages:
[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.
-Claire Cain Miller, “Ada Lovelace,” NYT Overlooked Obituaries
Why is Part-of-Speech Tagging Useful?#
I don’t mean to go all Language Nerd on you, but parts of speech are important. Even if they seem kind of boring. Parts of speech are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.
By computationally identifying parts of speech, we can start computationally exploring syntax, the relationship between words — rather than only focusing on words in isolation, as we did with tf-idf. Though parts of speech may seem pedantic, they help computers (and us) crack at that ever-elusive abstract noun: meaning.
spaCy and Natural Language Processing (NLP)#
To computationally identify parts of speech, we’re going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.
To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more.
Install spaCy#
To use spaCy, we first need to install the library.
!pip install -U spacy
Import Libraries#
Then we’re going to import spacy
and displacy
, a special spaCy module for visualization.
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 400
pd.options.display.max_colwidth = 400
We’re also going to import the Counter
module for counting nouns, verbs, adjectives, etc., and the pandas
library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).
Download Language Model#
Next we need to download the English-language model (en_core_web_sm
), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm
model by running the cell below:
!python -m spacy download en_core_web_sm
Requirement already satisfied: en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (2.1.0)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.
spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.
Load Language Model#
Once the model is downloaded, we need to load it with spacy.load()
and assign it to the variable nlp
.
nlp = spacy.load('en_core_web_sm')
Create a Processed spaCy Document#
Whenever we use spaCy, our first step will be to create a processed spaCy document
with the loaded NLP model nlp()
. Most of the heavy NLP lifting is done in this line of code. After processing, the document
object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.
To test out spaCy’s part-of-speech tagging, we’ll begin by processing a sample sentence from Ada Lovelace’s obituary:
“[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.
This sentence makes for an interesting example because it is syntactically complex and because it includes contains difficultly ambiguous words such as “spell,” “abstract,” and “force.”
sample = """She “has thrown her magical spell around the most abstract of Sciences."""
document = nlp(sample)
spaCy Part-of-Speech Tagging#
POS |
Description |
Examples |
---|---|---|
ADJ |
adjective |
big, old, green, incomprehensible, first |
ADP |
adposition |
in, to, during |
ADV |
adverb |
very, tomorrow, down, where, there |
AUX |
auxiliary |
is, has (done), will (do), should (do) |
CONJ |
conjunction |
and, or, but |
CCONJ |
coordinating conjunction |
and, or, but |
DET |
determiner |
a, an, the |
INTJ |
interjection |
psst, ouch, bravo, hello |
NOUN |
noun |
girl, cat, tree, air, beauty |
NUM |
numeral |
1, 2017, one, seventy-seven, IV, MMXIV |
PART |
particle |
’s, not, |
PRON |
pronoun |
I, you, he, she, myself, themselves, somebody |
PROPN |
proper noun |
Mary, John, London, NATO, HBO |
PUNCT |
punctuation |
., (, ), ? |
SCONJ |
subordinating conjunction |
if, while, that |
SYM |
symbol |
$, %, §, ©, +, −, ×, ÷, =, :), 😝 |
VERB |
verb |
run, runs, running, eat, ate, eating |
X |
other |
sfpksdpsxmsa |
SPACE |
space |
Above is a POS chart taken from spaCy’s website, which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy’s POS tagging in action, we can use the spaCy module displacy
on our sample document
with the style=
parameter set to “dep” (short for dependency parsing):
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}
displacy.render(document, style="dep", options=options)
As you can see, spaCy has correctly identified that “spell” and “force” are nouns in our sample sentence:
for token in document:
if token.pos_ == "NOUN":
print(token, token.pos_)
spell NOUN
But if we look at the same words in a different context — in a sentence that I made up — spaCy can identify when these words have changed grammatical roles and meanings.
You shouldn’t force someone to learn how to spell Babbage. They just need practice. You can’t abstract it.
document = nlp("You shouldn't force someone to learn how to spell Babbage. They just need practice. You can't abstract it.")
for token in document:
if token.pos_ == "VERB":
print(token, token.pos_)
force VERB
learn VERB
spell VERB
need VERB
abstract VERB
Where previously spaCy had identified “force” and “spell” as nouns, here spaCy correctly identifies the words “force,” “spell,” and “abstract” as verbs.
Practicing with Dracula#
filepath = "../texts/literature/Dracula_Bram-Stoker.txt"
document = nlp(open(filepath, encoding="utf-8").read())
Get Adjectives#
POS |
Description |
Examples |
---|---|---|
ADJ |
adjective |
big, old, green, incomprehensible, first |
To extract and count the adjectives in Dracula, we will follow the same model as above, except we’ll add an if
statement that will pull out words only if their POS label matches “ADJ.”
Python Review
While we demonstrate how to extract parts of speech in the sections below, we’re also going to reinforce some integral Python skills. Notice how we use for
loops and if
statements to .append()
specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
Here we make a list of the adjectives identified in Dracula:
adjs = []
for token in document:
if token.pos_ == 'ADJ':
adjs.append(token.text)
adjs
['*',
'available',
'next',
'wonderful',
'little',
'correct',
'possible',
'western',
'splendid',
'noble',
'Turkish',
'good',
'red',
'good',
'thirsty',
'national',
'able',
'German',
'useful',
'able',
'extreme',
'Carpathian',
'wildest',
'least',
'able',
'exact',
'own',
'distinct',
'latter',
'eleventh',
'known',
'imaginative',
'interesting',
'comfortable',
'thirsty',
'continuous',
'more',
'mamaliga',
'excellent',
'impletata',
'little',
'more',
'further',
'unpunctual',
'full',
'little',
'steep',
'such',
'old',
'wide',
'subject',
'great',
'strong',
'outside',
'clear',
'short',
'round',
'picturesque',
'pretty',
'clumsy',
'full',
'white',
'other',
'most',
'big',
'strangest',
'barbarian',
'big',
'great',
'baggy',
'dirty',
'white',
'white',
'enormous',
'heavy',
'high',
'long',
'black',
'heavy',
'black',
'picturesque',
'old',
'Oriental',
'harmless',
'natural',
'dark',
'interesting',
'old',
'stormy',
'great',
'terrible',
'separate',
'very',
'seventeenth',
'proper',
'great',
'old',
'fashioned',
'elderly',
'usual',
'white',
'long',
'double',
'coloured',
'tight',
'elderly',
'white',
'happy',
'beautiful',
'best',
'reticent',
'true',
'least',
'old',
'other',
'frightened',
'mysterious',
'old',
'hysterical',
'young',
'excited',
'German',
'other',
'able',
'many',
'important',
'fourth',
'evil',
'full',
'such',
'evident',
'least',
'ridiculous',
'comfortable',
'imperative',
'such',
'idolatrous',
'ungracious',
'old',
'late',
'old',
'many',
'ghostly',
'easy',
'usual',
'high',
'distant',
'jagged',
'big',
'little',
'mixed',
'sleepy',
'awake',
'many',
'odd',
'red',
'simple',
'disagreeable',
'most',
'queer',
'many',
'same',
'other',
'considerable',
'evil',
'pleasant',
'unknown',
'unknown',
'hearted',
'sorrowful',
'sympathetic',
'last',
'picturesque',
'wide',
'rich',
'green',
'wide',
'whole',
'big',
'small',
'ghostly',
'able',
'green',
'sloping',
'full',
'steep',
'blank',
'gable',
'bewildering',
'green',
'green',
'pine',
'rugged',
'feverish',
'bent',
'excellent',
'different',
'general',
'old',
'good',
'old',
'foreign',
'green',
'mighty',
'lofty',
'full',
'glorious',
'beautiful',
'deep',
'purple',
'green',
'brown',
'endless',
'jagged',
'snowy',
'mighty',
'white',
'lofty',
'serpentine',
'right',
'endless',
'lower',
'snowy',
'delicate',
'cool',
'picturesque',
'prevalent',
'many',
'peasant',
'outer',
'many',
'new',
'beautiful',
'white',
'delicate',
'ordinary',
'long',
'like',
'sure',
'white',
'coloured',
'latter',
'long',
'cold',
'dark',
'dark',
'pine',
'great',
'weird',
'solemn',
'grim',
'strange',
'like',
'steep',
'fierce',
'grim',
'enough',
'such',
'only',
'dark',
'other',
'further',
'long',
'wild',
'further',
'greater',
'crazy',
'great',
'stormy',
'more',
'several',
'odd',
'varied',
'simple',
'good',
'kindly',
'strange',
'evil',
'evident',
'exciting',
'slightest',
'little',
'last',
'eastern',
'dark',
'rolling',
'heavy',
'oppressive',
'thunderous',
'dark',
'only',
'own',
'white',
'sandy',
'white',
'own',
'low',
'less',
'German',
'worse',
'own:--',
'next',
'better',
'next',
'universal',
'black',
'splendid',
'tall',
'long',
'brown',
'great',
'black',
'bright',
'red',
'early',
'stranger',
'swift',
'red',
'sharp',
'white',
'dead',
'strange',
'same',
'close',
'prodigious',
'late',
'strange',
'lonely',
'excellent',
'same',
'little',
'little',
'frightened',
'unknown',
'hard',
'complete',
'straight',
'same',
'curious',
'few',
'general',
'recent',
'sick',
'long',
'agonised',
'wild',
'first',
'rear',
'sudden',
'fright',
'sharper',
'same',
'minded',
'great',
'few',
'own',
'accustomed',
'quiet',
'able',
'extraordinary',
'manageable',
'great',
'far',
'narrow',
'great',
'colder',
'colder',
'fine',
'white',
'keen',
'nearer',
'afraid',
'right',
'faint',
'blue',
'same',
'less',
'asleep',
'awful',
'blue',
'faint',
'few',
'strange',
'optical',
'same',
'momentary',
'blue',
'worse',
'black',
'jagged',
'beetling',
'white',
'red',
'long',
'shaggy',
'more',
'terrible',
'grim',
'such',
'true',
'peculiar',
'painful',
'only',
'imperious',
'long',
'impalpable',
'heavy',
'strange',
'uncanny',
'dreadful',
'afraid',
'interminable',
'complete',
'occasional',
'quick',
'main',
'conscious',
'vast',
'tall',
'black',
'broken',
'jagged',
'moonlit',
'asleep',
'awake',
'remarkable',
'considerable',
'several',
'dark',
'great',
'round',
'bigger',
'able',
'prodigious',
'great',
'old',
'large',
'massive',
'dim',
'dark',
'dark',
'likely',
'endless',
'grim',
'customary',
'successful',
'awake',
'horrible',
'awake',
'awake',
'patient',
'heavy',
'great',
'massive',
'loud',
'long',
'great',
'tall',
'old',
'clean',
'shaven',
'long',
'white',
'single',
'antique',
'long',
'open',
'old',
'right',
'courtly',
'excellent',
'strange',
'Welcome',
'own',
'cold',
'dead',
'Welcome',
'akin',
'same',
'sure',
'courtly',
'late',
'available',
'great',
'great',
'open',
'heavy',
'mighty',
'great',
'small',
'octagonal',
'single',
'welcome',
'great',
'top',
'fresh',
'hollow',
'wide',
'door:--',
'ready',
'other',
'prepared',
'courteous',
'normal',
'hasty',
'other',
'great',
'graceful',
'charming',
'least',
'constant',
'happy',
'sufficient',
'possible',
'young',
'full',
'own',
'faithful',
'discreet',
'silent',
'ready',
'excellent',
'old',
'many',
'same',
'strong',
'strong',
'high',
'thin',
'arched',
'lofty',
'domed',
'massive',
'bushy',
'own',
'heavy',
'cruel',
'looking',
'sharp',
'white',
'remarkable',
'astonishing',
'pale',
'pointed',
'broad',
'strong',
'general',
'extraordinary',
'white',
'fine',
'close',
'coarse',
'broad',
'squat',
'Strange',
'long',
'fine',
'sharp',
'rank',
'horrible',
'grim',
'more',
'protuberant',
'own',
'silent',
'first',
'dim',
'strange',
'many',
'strange',
'tired',
'ready',
'courteous',
'octagonal',
'strange',
'own',
'dear',
'early',
'last',
'own',
'cold',
'hot',
'absent',
'hearty',
'odd',
'extraordinary',
'round',
'immense',
'costliest',
'beautiful',
'fabulous',
'old',
'excellent',
'little',
'opposite',
'great',
'vast',
'English',
'whole',
'full',
'English',
'recent',
'varied',
'political',
'English',
'such',
'Blue',
'hearty',
'good',
'glad',
'sure',
'much',
'good',
'past',
'many',
'many',
'great',
'crowded',
'mighty',
'whirl',
'flattering',
'little',
'True',
'enough',
'noble',
'common',
'strange',
'content',
'long',
'least',
'other',
'alone',
'new',
'English',
'smallest',
'sorry',
'many',
'important',
'willing',
'sure',
'many',
'strange',
'strange',
'much',
'evident',
'many',
'strange',
'blue',
'certain',
'last',
'evil',
'unchecked',
'blue',
'last',
'little',
'old',
'aged',
'artificial',
'triumphant',
'little',
'friendly',
'undiscovered',
'sure',
'long',
'sharp',
'dear',
'own',
'able',
'right',
'more',
'dead',
'other',
'last',
'own',
'next',
'interested',
'myriad',
'more',
'needful',
'alone',
'other',
'necessary',
'ready',
'suitable',
'high',
'ancient',
'heavy',
'large',
'closed',
'heavy',
'old',
'old',
'sided',
'cardinal',
'solid',
'many',
'gloomy',
'deep',
'dark',
'small',
'clear',
'fair',
'sized',
'large',
'mediæval',
'thick',
'few',
'high',
'close',
'old',
'various',
'straggling',
'great',
'few',
'close',
'large',
'private',
'lunatic',
'visible',
'glad',
'old',
'big',
'old',
'new',
'habitable',
'few',
'old',
'common',
'dead',
'bright',
'much',
'sparkling',
'young',
'gay',
'young',
'weary',
'dead',
'attuned',
'many',
'cold',
'broken',
'alone',
'malignant',
'saturnine',
'little',
'certain',
'little',
'new',
'other',
'better',
'ready',
'next',
'excellent',
'ready',
'previous',
'last',
'conceivable',
'late',
'sleepy',
'long',
'tired',
'preternatural',
'clear',
'remiss',
'long',
'dear',
'new',
'interesting',
'own',
'little',
'warm',
'glad',
'first',
'strange',
'uneasy',
'safe',
'strange',
'only',
'prosaic',
'few',
'Good',
'whole',
'mistaken',
'close',
'whole',
'startling',
'many',
'strange',
'vague',
'near',
'little',
'half',
'instant',
'dangerous',
'wretched',
'bauble',
'heavy',
'terrible',
'annoying',
'strange',
'peculiar',
'little',
'magnificent',
'very',
'terrible',
'green',
'deep',
'silver',
'deep',
'locked',
'castle',
'available',
'veritable',
'wild',
'little',
'other',
'few',
'mad',
'helpless',
'best',
'definite',
'certain',
'own',
'only',
'open',
'own',
'desperate',
'latter',
'great',
'own',
'odd',
'menial',
'fright',
'terrible',
'terrible',
'wild',
'good',
'good',
'odd',
'idolatrous',
'tangible',
'careful',
'long',
'few',
'present',
'own',
'fascinating',
'whole',
'excited',
'great',
'white',
'main',
'race:--',
'proud',
'many',
'brave',
'European',
'Ugric',
'such',
'fell',
'warlike',
'old',
'great',
'proud',
'strange',
'Hungarian',
'Hungarian',
'victorious',
'more',
'endless',
'sleepless',
'bloody',
'warlike',
'great',
'own',
'own',
'own',
'unworthy',
'other',
'later',
'great',
'bloody',
'good',
'Hungarian',
'free',
'young',
'warlike',
'precious',
'dishonourable',
'great',
'bare',
'meagre',
'own',
'Last',
'legal',
'certain',
'certain',
'useful',
'more',
'wise',
'more',
'certain',
'practical',
'local',
'beautiful',
'good',
'Good',
'strange',
'local',
'much',
'more',
'easy',
'other',
'local',
'further',
'such',
'Good',
'best',
'wonderful',
'much',
'wonderful',
'available',
'first',
'other',
'young',
'heavy',
'other',
'cold',
'own',
'smooth',
'good',
'young',
'other',
'thinnest',
'foreign',
'quiet',
'sharp',
'red',
'careful',
'able',
'formal',
'quiet',
'several',
'own',
'third',
'fourth',
'second',
'fourth',
'about',
'much',
'private',
'dear',
'young',
'other',
'old',
'many',
'bad',
'own',
'safe',
'careful',
'gruesome',
'only',
'terrible',
'unnatural',
'horrible',
'last',
'freer',
'little',
'vast',
'inaccessible',
'narrow',
'fresh',
'nocturnal',
'own',
'full',
'horrible',
'terrible',
'beautiful',
'soft',
'yellow',
'light',
'soft',
'distant',
'mere',
'own',
'tall',
'deep',
'complete',
'many',
'many',
'interested',
'wonderful',
'small',
'very',
'whole',
'castle',
'dreadful',
'great',
'weird',
'clear',
...]
Then we count the unique adjectives in this list with the Counter()
module:
adjs_tally = Counter(adjs)
adjs_tally.most_common()
[('good', 198),
('old', 188),
('other', 185),
('own', 184),
('more', 178),
('great', 173),
('poor', 171),
('little', 164),
('dear', 151),
('much', 148),
('such', 129),
('last', 115),
('same', 110),
('white', 103),
('many', 100),
('terrible', 99),
('full', 97),
('long', 90),
('few', 86),
('strange', 85),
('first', 78),
('new', 73),
('ready', 71),
('dead', 69),
('red', 67),
('whole', 66),
('open', 66),
('sweet', 65),
('dark', 60),
('strong', 59),
('very', 57),
('true', 54),
('heavy', 53),
('young', 53),
('quick', 48),
('able', 47),
('happy', 47),
('right', 47),
('asleep', 47),
('big', 44),
('small', 43),
('sure', 43),
('better', 43),
('best', 41),
('cold', 41),
('wild', 41),
('close', 41),
('free', 41),
('late', 40),
('certain', 40),
('present', 40),
('afraid', 39),
('high', 38),
('quiet', 37),
('pale', 36),
('silent', 35),
('glad', 35),
('usual', 33),
('sad', 33),
('possible', 32),
('bad', 32),
('least', 31),
('beautiful', 31),
('low', 31),
('awful', 31),
('thin', 31),
('hard', 30),
('brave', 30),
('alone', 29),
('mad', 29),
('next', 28),
('deep', 28),
('anxious', 28),
('wonderful', 27),
('empty', 27),
('electronic', 27),
('black', 26),
('sharp', 26),
('half', 26),
('awake', 25),
('sudden', 25),
('horrible', 25),
('necessary', 25),
('fair', 25),
('safe', 25),
('Good', 25),
('grim', 24),
('bright', 24),
('fresh', 24),
('tired', 24),
('wide', 23),
('different', 23),
('only', 23),
('common', 22),
('satisfied', 22),
('noble', 21),
('short', 21),
('enough', 21),
('dreadful', 21),
('bitter', 21),
('weak', 21),
('odd', 20),
('Poor', 20),
('well', 19),
('round', 18),
('most', 18),
('evident', 18),
('worse', 18),
('ill', 18),
('real', 18),
('Dead', 18),
('simple', 17),
('less', 17),
('fine', 17),
('careful', 17),
('earnest', 17),
('wrong', 17),
('evil', 16),
('several', 16),
('past', 16),
('clever', 16),
('hypnotic', 16),
('latter', 15),
('sleepy', 15),
('fierce', 15),
('greater', 15),
('complete', 15),
('second', 15),
('angry', 15),
('cheerful', 15),
('blind', 15),
('excellent', 14),
('general', 14),
('tall', 14),
('Last', 14),
('soft', 14),
('sacred', 14),
('worth', 14),
('nice', 14),
('easy', 13),
('green', 13),
('blue', 13),
('large', 13),
('various', 13),
('foul', 13),
('calm', 13),
('fearful', 13),
('dearest', 13),
('surprised', 13),
('alive', 13),
('further', 12),
('clear', 12),
('important', 12),
('considerable', 12),
('early', 12),
('broken', 12),
('thick', 12),
('impossible', 12),
('sane', 12),
('nervous', 12),
('unhappy', 12),
('kind', 12),
('personal', 12),
('stern', 12),
('grateful', 12),
('useful', 11),
('unknown', 11),
('lower', 11),
('solemn', 11),
('faint', 11),
('conscious', 11),
('sufficient', 11),
('hot', 11),
('dangerous', 11),
('precious', 11),
('lovely', 11),
('human', 11),
('-', 11),
('violent', 11),
('grave', 11),
('selfish', 11),
('special', 11),
('public', 11),
('rough', 11),
('available', 10),
('interesting', 10),
('endless', 10),
('sick', 10),
('painful', 10),
('main', 10),
('broad', 10),
('willing', 10),
('uneasy', 10),
('near', 10),
('wise', 10),
('mere', 10),
('voluptuous', 10),
('deadly', 10),
('serious', 10),
('physical', 10),
('grey', 10),
('stronger', 10),
('tiny', 10),
('particular', 10),
('exact', 9),
('comfortable', 9),
('steep', 9),
('natural', 9),
('double', 9),
('mysterious', 9),
('excited', 9),
('distant', 9),
('foreign', 9),
('mighty', 9),
('lonely', 9),
('curious', 9),
('extraordinary', 9),
('narrow', 9),
('dim', 9),
('sorry', 9),
('previous', 9),
('desperate', 9),
('light', 9),
('unconscious', 9),
('worst', 9),
('difficult', 9),
('active', 9),
('tight', 8),
('ordinary', 8),
('like', 8),
('peculiar', 8),
('loud', 8),
('faithful', 8),
('warm', 8),
('vague', 8),
('helpless', 8),
('local', 8),
('former', 8),
('due', 8),
('similar', 8),
('resolute', 8),
('secret', 8),
('miserable', 8),
('funny', 8),
('absolute', 8),
('busy', 8),
('strait', 8),
('mortal', 8),
('spiritual', 8),
('eager', 8),
('left', 8),
('rare', 8),
('equal', 8),
('bent', 7),
('straight', 7),
('sharper', 7),
('far', 7),
('successful', 7),
('patient', 7),
('single', 7),
('courteous', 7),
('interested', 7),
('lunatic', 7),
('proud', 7),
('later', 7),
('about', 7),
('gentle', 7),
('More', 7),
('churchyard', 7),
('beloved', 7),
('regular', 7),
('accurate', 7),
('armed', 7),
('weaker', 7),
('rusty', 7),
('utmost', 7),
('mental', 7),
('infinite', 7),
('dry', 7),
('stertorous', 7),
('holy', 7),
('correct', 6),
('pretty', 6),
('frightened', 6),
('pleasant', 6),
('recent', 6),
('agonised', 6),
('fright', 6),
('hollow', 6),
('hearty', 6),
('English', 6),
('sized', 6),
('private', 6),
('visible', 6),
('weary', 6),
('bare', 6),
('legal', 6),
('yellow', 6),
('dusty', 6),
('upset', 6),
('fatal', 6),
('wooden', 6),
('diabolical', 6),
('intense', 6),
('favourite', 6),
('startled', 6),
('intellectual', 6),
('suspicious', 6),
('ghastly', 6),
('immediate', 6),
('picturesque', 5),
('proper', 5),
('hysterical', 5),
('ghostly', 5),
('jagged', 5),
('hearted', 5),
('blank', 5),
('cool', 5),
('swift', 5),
('accustomed', 5),
('vast', 5),
('massive', 5),
('likely', 5),
('clean', 5),
('prepared', 5),
('bushy', 5),
('cruel', 5),
('immense', 5),
('True', 5),
('startling', 5),
('instant', 5),
('silver', 5),
('castle', 5),
('bloody', 5),
('third', 5),
('brilliant', 5),
('golden', 5),
('cunning', 5),
('sensitive', 5),
('Same', 5),
('dizzy', 5),
('vain', 5),
('subtle', 5),
('despairing', 5),
('earthly', 5),
('homicidal', 5),
('vital', 5),
('concerned', 5),
('unusual', 5),
('restless', 5),
('additional', 5),
('loose', 5),
('humble', 5),
('younger', 5),
('pure', 5),
('apparent', 5),
('hush', 5),
('garlic', 5),
('pleased', 5),
('slow', 5),
('original', 5),
('individual', 5),
('thicker', 5),
('known', 4),
('dirty', 4),
('harmless', 4),
('fourth', 4),
('sorrowful', 4),
('snowy', 4),
('delicate', 4),
('weird', 4),
('keen', 4),
('nearer', 4),
('courtly', 4),
('normal', 4),
('charming', 4),
('constant', 4),
('pointed', 4),
('whirl', 4),
('content', 4),
('ancient', 4),
('gay', 4),
('mistaken', 4),
('locked', 4),
('definite', 4),
('warlike', 4),
('smooth', 4),
('sheer', 4),
('south', 4),
('wicked', 4),
('touching', 4),
('manifest', 4),
('rocky', 4),
('safer', 4),
('laden', 4),
('harsh', 4),
('lethal', 4),
('useless', 4),
('doubtless', 4),
('idle', 4),
('nearest', 4),
('married', 4),
('haired', 4),
('ashamed', 4),
('perfect', 4),
('Little', 4),
('unselfish', 4),
('sceptical', 4),
('Sacred', 4),
('ye', 4),
('firm', 4),
('zoöphagous', 4),
('anæmic', 4),
('sore', 4),
('greatest', 4),
('entire', 4),
('needless', 4),
('Russian', 4),
('steady', 4),
('ignorant', 4),
('agonising', 4),
('rosy', 4),
('strict', 4),
('deserted', 4),
('padded', 4),
('troubled', 4),
('unexpected', 4),
('reasonable', 4),
('slight', 4),
('lethargic', 4),
('desolate', 4),
('deeper', 4),
('bewildered', 4),
('ole', 4),
('tender', 4),
('intent', 4),
('feeble', 4),
('whiter', 4),
('unable', 4),
('amazed', 4),
('harrowing', 4),
('closer', 4),
('rude', 4),
('leaden', 4),
('unclean', 4),
('hungry', 4),
('official', 4),
('brute', 4),
('highest', 4),
('worried', 4),
('hellish', 4),
('stable', 4),
('Turkish', 3),
('thirsty', 3),
('German', 3),
('elderly', 3),
('rich', 3),
('lofty', 3),
('glorious', 3),
('purple', 3),
('brown', 3),
('outer', 3),
('eastern', 3),
('oppressive', 3),
('sandy', 3),
('colder', 3),
('uncanny', 3),
('occasional', 3),
('remarkable', 3),
('Strange', 3),
('needful', 3),
('suitable', 3),
('solid', 3),
('Hungarian', 3),
('sleepless', 3),
('practical', 3),
('formal', 3),
('gruesome', 3),
('freer', 3),
('unlocked', 3),
('nineteenth', 3),
('Great', 3),
('dreamy', 3),
('delightful', 3),
('repulsive', 3),
('shadowy', 3),
('intact', 3),
('uncertain', 3),
('key', 3),
('square', 3),
('post', 3),
('earthy', 3),
('powerful', 3),
('happier', 3),
('redder', 3),
('bloated', 3),
('quickest', 3),
('inclined', 3),
('worthy', 3),
('balanced', 3),
('drunk', 3),
('loving', 3),
('fit', 3),
('nigh', 3),
('wakeful', 3),
('devouring', 3),
('manifold', 3),
('wet', 3),
('dank', 3),
('inner', 3),
('foolish', 3),
('eyed', 3),
('fond', 3),
('haggard', 3),
('languid', 3),
('respectful', 3),
('religious', 3),
('jealous', 3),
('flit', 3),
('medical', 3),
('advanced', 3),
('truest', 3),
('appalling', 3),
('dull', 3),
('sullen', 3),
('gravely:--', 3),
('thankful', 3),
('narcotic', 3),
('probable', 3),
('nauseous', 3),
('sovereign', 3),
('bold', 3),
('stupid', 3),
('frightful', 3),
('frantic', 3),
('frequent', 3),
('decent', 3),
('aware', 3),
('monstrous', 3),
('dreary', 3),
('bowed', 3),
('negative', 3),
('daily', 3),
('unholy', 3),
('smaller', 3),
('actual', 3),
('limited', 3),
('electric', 3),
('huge', 3),
('future', 3),
('powerless', 3),
('gallant', 3),
('final', 3),
('higher', 3),
('Unclean', 3),
('orderly', 3),
('plain', 3),
('wily', 3),
('derivative', 3),
('applicable', 3),
('defective', 3),
('western', 2),
('splendid', 2),
('extreme', 2),
('Carpathian', 2),
('continuous', 2),
('subject', 2),
('clumsy', 2),
('strangest', 2),
('stormy', 2),
('fashioned', 2),
('coloured', 2),
('reticent', 2),
('ridiculous', 2),
('imperative', 2),
('idolatrous', 2),
('mixed', 2),
('queer', 2),
('sympathetic', 2),
('pine', 2),
('varied', 2),
('kindly', 2),
('exciting', 2),
('slightest', 2),
('stranger', 2),
('prodigious', 2),
('minded', 2),
('momentary', 2),
('imperious', 2),
('moonlit', 2),
('bigger', 2),
('shaven', 2),
('Welcome', 2),
('octagonal', 2),
('welcome', 2),
('top', 2),
('arched', 2),
('looking', 2),
('absent', 2),
('opposite', 2),
('political', 2),
('aged', 2),
('friendly', 2),
('myriad', 2),
('closed', 2),
('malignant', 2),
('prosaic', 2),
('veritable', 2),
('fascinating', 2),
('fell', 2),
('unnatural', 2),
('sidelong', 2),
('thorough', 2),
('hateful', 2),
('merciful', 2),
('intolerable', 2),
('super', 2),
('languorous', 2),
('slender', 2),
('vile', 2),
('villainy', 2),
('piteous', 2),
('ruthless', 2),
('tiniest', 2),
('aërial', 2),
('phantom', 2),
('extravagant', 2),
('naked', 2),
('British', 2),
('stately', 2),
('out:--', 2),
('swollen', 2),
('merry', 2),
('strained', 2),
('cursed', 2),
('handsome', 2),
('curly', 2),
('fancy', 2),
('American', 2),
('momentous', 2),
('playful', 2),
('honest', 2),
('confused', 2),
('ungrateful', 2),
('determined', 2),
('valuable', 2),
('excitable', 2),
('secure', 2),
('happiest', 2),
('sweeter', 2),
('grand', 2),
('stean', 2),
('stubble', 2),
('fed', 2),
('kitten', 2),
('raw', 2),
('exceptional', 2),
('easier', 2),
('giant', 2),
('aud', 2),
('mild', 2),
('downward', 2),
('lively', 2),
('incredible', 2),
('flat', 2),
('chief', 2),
('impatient', 2),
('superstitious', 2),
('First', 2),
('gone', 2),
('furious', 2),
('paler', 2),
('larger', 2),
('routine', 2),
('servile', 2),
('sublime', 2),
('shifty', 2),
('bulky', 2),
('unhurt', 2),
('exhausted', 2),
('fat', 2),
('functional', 2),
('malady', 2),
('stiff', 2),
('seeming', 2),
('frenzied', 2),
('recuperative', 2),
('beneficial', 2),
('alarmed', 2),
('beneficent', 2),
('stalwart', 2),
('answer:--', 2),
('cerebral', 2),
('Quick', 2),
('undone', 2),
('pallid', 2),
('crimson', 2),
('medicinal', 2),
('healthy', 2),
('poignant', 2),
('hospitable', 2),
('ead', 2),
('ome', 2),
('unnecessary', 2),
('pronounced', 2),
('peaceful', 2),
('surgical', 2),
('acrid', 2),
('sternest', 2),
('outstretched', 2),
('profound', 2),
('so', 2),
('unattended', 2),
('Unopened', 2),
('genial', 2),
('spirited', 2),
('lips:--', 2),
('sterner', 2),
('professional', 2),
('specific', 2),
('mortem', 2),
('shocked', 2),
('eternal', 2),
('direct', 2),
('hostile', 2),
('youthful', 2),
('moral', 2),
('logical', 2),
('forceful', 2),
('harder', 2),
('following', 2),
('silly', 2),
('indicative', 2),
('typewritten', 2),
('barren', 2),
('overwrought', 2),
('corporeal', 2),
('comparative', 2),
('numerous', 2),
('northern', 2),
('fewer', 2),
('horrid', 2),
('unhallowed', 2),
('Most', 2),
('Un', 2),
('rational', 2),
('puzzled', 2),
('frank', 2),
('set', 2),
('affected', 2),
('obedient', 2),
('careless', 2),
('callous', 2),
('livid', 2),
('horrified', 2),
('carnal', 2),
('devilish', 2),
('blessed', 2),
('Brave', 2),
('hideous', 2),
('awkward', 2),
('chronological', 2),
('thoughtful', 2),
('ultimate', 2),
('central', 2),
('neutral', 2),
('emotional', 2),
('overwhelmed', 2),
('appealing', 2),
('contemptuous', 2),
('apt', 2),
('elemental', 2),
('positive', 2),
('live', 2),
('meaner', 2),
('idiotic', 2),
('relieved', 2),
('conventional', 2),
('liable', 2),
('longer', 2),
('respectable', 2),
('amenable', 2),
('odour', 2),
('corrupt', 2),
('sound', 2),
('false', 2),
('indifferent', 2),
('unspeakable', 2),
('typical', 2),
('laconic', 2),
('weakest', 2),
('paralysed', 2),
('pitiful', 2),
('simplest', 2),
('forgetful', 2),
('holiest', 2),
('devoted', 2),
('Hush', 2),
('greenish', 2),
('flagged', 2),
('radiant', 2),
('latest', 2),
('sole', 2),
('doubtful', 2),
('alert', 2),
('predestinate', 2),
('criminal', 2),
('commercial', 2),
('Many', 2),
('FULL', 2),
('online', 2),
('readable', 2),
('widest', 2),
('exempt', 2),
('federal', 2),
('*', 1),
('national', 1),
('wildest', 1),
('distinct', 1),
('eleventh', 1),
('imaginative', 1),
('mamaliga', 1),
('impletata', 1),
('unpunctual', 1),
('outside', 1),
('barbarian', 1),
('baggy', 1),
('enormous', 1),
('Oriental', 1),
('separate', 1),
('seventeenth', 1),
('ungracious', 1),
('disagreeable', 1),
('sloping', 1),
('gable', 1),
('bewildering', 1),
('rugged', 1),
('feverish', 1),
('serpentine', 1),
('prevalent', 1),
('peasant', 1),
('crazy', 1),
('rolling', 1),
('thunderous', 1),
('own:--', 1),
('universal', 1),
('rear', 1),
('manageable', 1),
('optical', 1),
('beetling', 1),
('shaggy', 1),
('impalpable', 1),
('interminable', 1),
('customary', 1),
('antique', 1),
('akin', 1),
('door:--', 1),
('hasty', 1),
('graceful', 1),
('discreet', 1),
('domed', 1),
('astonishing', 1),
('coarse', 1),
('squat', 1),
('rank', 1),
('protuberant', 1),
('costliest', 1),
('fabulous', 1),
('Blue', 1),
('crowded', 1),
('flattering', 1),
('smallest', 1),
('unchecked', 1),
('artificial', 1),
('triumphant', 1),
('undiscovered', 1),
('sided', 1),
('cardinal', 1),
('gloomy', 1),
('mediæval', 1),
('straggling', 1),
('habitable', 1),
('sparkling', 1),
('attuned', 1),
('saturnine', 1),
('conceivable', 1),
('preternatural', 1),
('remiss', 1),
('wretched', 1),
('bauble', 1),
('annoying', 1),
('magnificent', 1),
('menial', 1),
('tangible', 1),
('race:--', 1),
('European', 1),
('Ugric', 1),
('victorious', 1),
('unworthy', 1),
('dishonourable', 1),
('meagre', 1),
('thinnest', 1),
('inaccessible', 1),
('nocturnal', 1),
('impregnable', 1),
('bygone', 1),
('curtainless', 1),
('unchanged', 1),
('wavy', 1),
('musical', 1),
('deliberate', 1),
('thrilling', 1),
('scarlet', 1),
('Lower', 1),
('lurid', 1),
('soulless', 1),
('aghast', 1),
('unquestionable', 1),
('unwound', 1),
('sanctuary', 1),
('madness', 1),
('fearless', 1),
('spoken', 1),
('smoothest', 1),
('letters:--', 1),
('cheery', 1),
('surest', 1),
('sturdy', 1),
('studded', 1),
('unloaded', 1),
('spade', 1),
('nebulous', 1),
('Quicker', 1),
('quicker', 1),
('dishevelled', 1),
('metallic', 1),
('vaporous', 1),
('Austrian', 1),
('Greek', 1),
('circular', 1),
('sickly', 1),
('heavier', 1),
('stony', 1),
('genuine', 1),
('real:--', 1),
('Close', 1),
('ponderous', 1),
('louder', 1),
('angrier', 1),
('fuller', 1),
('filthy', 1),
('clanging', 1),
('fast', 1),
('assistant', 1),
('stenographic', 1),
('hurried', 1),
('imperturbable', 1),
('tough', 1),
('psychological', 1),
('exquisite', 1),
('manly', 1),
('sloppy', 1),
('rarer', 1),
('appetite', 1),
('sanguine', 1),
('centripetal', 1),
('centrifugal', 1),
('paramount', 1),
('noblest', 1),
('romantic', 1),
('nicest', 1),
('mournful', 1),
('gnarled', 1),
('brusquely:--', 1),
('eatin', 1),
('cheap', 1),
('dictatorial', 1),
('illsome', 1),
('ireful', 1),
('acant', 1),
('slippy', 1),
('poorish', 1),
('aftest', 1),
('bier', 1),
('Lively', 1),
('dearly', 1),
('pious', 1),
('tombstean', 1),
('paved', 1),
('back', 1),
('wholesome', 1),
('Whole', 1),
('rudimentary', 1),
('sleek', 1),
('unprepared', 1),
('tame', 1),
('undeveloped', 1),
('opiate', 1),
('cumulative', 1),
('hopeless', 1),
...]
Then we make a dataframe from this list:
Pandas Review
Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]
adj | count | |
---|---|---|
0 | good | 198 |
1 | old | 188 |
2 | other | 185 |
3 | own | 184 |
4 | more | 178 |
5 | great | 173 |
6 | poor | 171 |
7 | little | 164 |
8 | dear | 151 |
9 | much | 148 |
10 | such | 129 |
11 | last | 115 |
12 | same | 110 |
13 | white | 103 |
14 | many | 100 |
15 | terrible | 99 |
16 | full | 97 |
17 | long | 90 |
18 | few | 86 |
19 | strange | 85 |
20 | first | 78 |
21 | new | 73 |
22 | ready | 71 |
23 | dead | 69 |
24 | red | 67 |
25 | whole | 66 |
26 | open | 66 |
27 | sweet | 65 |
28 | dark | 60 |
29 | strong | 59 |
30 | very | 57 |
31 | true | 54 |
32 | heavy | 53 |
33 | young | 53 |
34 | quick | 48 |
35 | able | 47 |
36 | happy | 47 |
37 | right | 47 |
38 | asleep | 47 |
39 | big | 44 |
40 | small | 43 |
41 | sure | 43 |
42 | better | 43 |
43 | best | 41 |
44 | cold | 41 |
45 | wild | 41 |
46 | close | 41 |
47 | free | 41 |
48 | late | 40 |
49 | certain | 40 |
50 | present | 40 |
51 | afraid | 39 |
52 | high | 38 |
53 | quiet | 37 |
54 | pale | 36 |
55 | silent | 35 |
56 | glad | 35 |
57 | usual | 33 |
58 | sad | 33 |
59 | possible | 32 |
60 | bad | 32 |
61 | least | 31 |
62 | beautiful | 31 |
63 | low | 31 |
64 | awful | 31 |
65 | thin | 31 |
66 | hard | 30 |
67 | brave | 30 |
68 | alone | 29 |
69 | mad | 29 |
70 | next | 28 |
71 | deep | 28 |
72 | anxious | 28 |
73 | wonderful | 27 |
74 | empty | 27 |
75 | electronic | 27 |
76 | black | 26 |
77 | sharp | 26 |
78 | half | 26 |
79 | awake | 25 |
80 | sudden | 25 |
81 | horrible | 25 |
82 | necessary | 25 |
83 | fair | 25 |
84 | safe | 25 |
85 | Good | 25 |
86 | grim | 24 |
87 | bright | 24 |
88 | fresh | 24 |
89 | tired | 24 |
90 | wide | 23 |
91 | different | 23 |
92 | only | 23 |
93 | common | 22 |
94 | satisfied | 22 |
95 | noble | 21 |
96 | short | 21 |
97 | enough | 21 |
98 | dreadful | 21 |
99 | bitter | 21 |
Get Nouns#
POS |
Description |
Examples |
---|---|---|
NOUN |
noun |
girl, cat, tree, air, beauty |
To extract and count nouns, we can follow the same model as above, except we will change our if
statement to check for POS labels that match “NOUN”.
nouns = []
for token in document:
if token.pos_ == 'NOUN':
nouns.append(token.text)
nouns_tally = Counter(nouns)
df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]
noun | count | |
---|---|---|
0 | time | 385 |
1 | night | 314 |
2 | man | 251 |
3 | room | 231 |
4 | way | 223 |
5 | day | 218 |
6 | hand | 202 |
7 | door | 199 |
8 | face | 197 |
9 | eyes | 188 |
10 | things | 171 |
11 | friend | 166 |
12 | work | 162 |
13 | life | 144 |
14 | heart | 140 |
15 | men | 138 |
16 | place | 133 |
17 | house | 133 |
18 | window | 116 |
19 | sleep | 112 |
20 | blood | 112 |
21 | one | 111 |
22 | moment | 106 |
23 | head | 104 |
24 | hands | 104 |
25 | morning | 98 |
26 | thing | 91 |
27 | _ | 89 |
28 | bed | 89 |
29 | death | 88 |
30 | mind | 87 |
31 | others | 82 |
32 | sort | 81 |
33 | child | 74 |
34 | fear | 72 |
35 | case | 72 |
36 | husband | 72 |
37 | rest | 71 |
38 | side | 68 |
39 | light | 68 |
40 | word | 66 |
41 | soul | 65 |
42 | world | 62 |
43 | part | 61 |
44 | days | 61 |
45 | box | 61 |
46 | ship | 61 |
47 | dear | 60 |
48 | water | 59 |
49 | end | 59 |
50 | lips | 59 |
51 | woman | 57 |
52 | look | 57 |
53 | hour | 56 |
54 | diary | 56 |
55 | horses | 56 |
56 | brain | 55 |
57 | body | 55 |
58 | sun | 54 |
59 | air | 54 |
60 | times | 54 |
61 | voice | 52 |
62 | fellow | 51 |
63 | words | 50 |
64 | earth | 50 |
65 | boxes | 50 |
66 | trouble | 49 |
67 | thought | 49 |
68 | mother | 48 |
69 | people | 47 |
70 | morrow | 47 |
71 | silence | 47 |
72 | letter | 46 |
73 | strength | 46 |
74 | cause | 46 |
75 | feet | 46 |
76 | power | 46 |
77 | kind | 45 |
78 | home | 45 |
79 | women | 45 |
80 | wolves | 45 |
81 | sunset | 44 |
82 | sea | 43 |
83 | key | 43 |
84 | o'clock | 43 |
85 | throat | 43 |
86 | patient | 43 |
87 | snow | 42 |
88 | teeth | 42 |
89 | knowledge | 42 |
90 | instant | 41 |
91 | friends | 41 |
92 | matter | 41 |
93 | duty | 40 |
94 | fire | 40 |
95 | none | 40 |
96 | coffin | 40 |
97 | sight | 39 |
98 | minutes | 39 |
99 | wind | 39 |
Get Verbs#
POS |
Description |
Examples |
---|---|---|
VERB |
verb |
run, runs, running, eat, ate, eating |
To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if
statement to match the POS label “VERB”).
Python Review
We can use a list comprehension to get our list of verbs in a single line of code! Closely examine the first line of code below:
verbs = [token.text for token in document if token.pos_ == 'VERB']
verbs_tally = Counter(verbs)
df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]
verb | count | |
---|---|---|
0 | said | 461 |
1 | know | 396 |
2 | see | 377 |
3 | have | 348 |
4 | came | 303 |
5 | went | 298 |
6 | come | 295 |
7 | do | 278 |
8 | had | 277 |
9 | go | 269 |
10 | seemed | 240 |
11 | took | 223 |
12 | saw | 216 |
13 | think | 216 |
14 | made | 196 |
15 | looked | 186 |
16 | was | 183 |
17 | tell | 175 |
18 | get | 168 |
19 | make | 163 |
20 | got | 156 |
21 | found | 154 |
22 | is | 153 |
23 | told | 144 |
24 | say | 141 |
25 | asked | 139 |
26 | take | 136 |
27 | knew | 130 |
28 | done | 128 |
29 | find | 114 |
30 | let | 113 |
31 | want | 112 |
32 | began | 109 |
33 | put | 106 |
34 | thought | 105 |
35 | hear | 101 |
36 | coming | 98 |
37 | seen | 95 |
38 | look | 94 |
39 | keep | 94 |
40 | heard | 91 |
41 | looking | 89 |
42 | felt | 86 |
43 | turned | 84 |
44 | left | 83 |
45 | stood | 80 |
46 | opened | 80 |
47 | read | 79 |
48 | help | 79 |
49 | give | 78 |
50 | sleep | 78 |
51 | feel | 77 |
52 | held | 73 |
53 | seems | 72 |
54 | are | 72 |
55 | lay | 72 |
56 | gone | 70 |
57 | sat | 69 |
58 | ask | 68 |
59 | gave | 67 |
60 | going | 65 |
61 | believe | 65 |
62 | seem | 64 |
63 | spoke | 64 |
64 | try | 64 |
65 | has | 64 |
66 | set | 63 |
67 | fear | 63 |
68 | speak | 62 |
69 | tried | 62 |
70 | write | 61 |
71 | did | 60 |
72 | fell | 57 |
73 | kept | 56 |
74 | understand | 55 |
75 | passed | 55 |
76 | leave | 55 |
77 | be | 55 |
78 | suppose | 53 |
79 | ran | 50 |
80 | answered | 50 |
81 | grew | 49 |
82 | like | 48 |
83 | love | 48 |
84 | taken | 47 |
85 | used | 47 |
86 | were | 46 |
87 | lost | 45 |
88 | called | 44 |
89 | die | 44 |
90 | says | 44 |
91 | stopped | 43 |
92 | wanted | 43 |
93 | moved | 43 |
94 | wish | 43 |
95 | wait | 42 |
96 | mean | 42 |
97 | meet | 42 |
98 | given | 42 |
99 | laid | 42 |
Keyword Extraction#
Get Sentences with Keyword#
spaCy can also identify sentences in a document. To access sentences, we can iterate through document.sents
and pull out the .text
of each sentence.
We can use spaCy’s sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below.
With the function find_sentences_with_keyword()
, we will iterate through document.sents
and pull out any sentence that contains a particular “keyword.” Then we will display these sentence with the keywords bolded.
import re
from IPython.display import Markdown, display
def find_sentences_with_keyword(keyword, document):
#Iterate through all the sentences in the document and pull out the text of each sentence
for sentence in document.sents:
sentence = sentence.text
#Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
if keyword.lower() in sentence.lower():
#Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
sentence = re.sub('\n', ' ', sentence)
sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
display(Markdown(sentence))
find_sentences_with_keyword(keyword="telegram", document=document)
“ _telegram from Arthur Holmwood to Quincey P. Morris.
“ _telegram, Arthur Holmwood to Seward.
You must send to me the telegram every day; and if there be cause I shall come again.
telegram, Seward, London, to Van Helsing, Amsterdam._
“ telegram, Seward, London, to Van Helsing, Amsterdam.
“ telegram, Seward, London, to Van Helsing, Amsterdam. “6 September.–Terrible change for the worse.
I hold over telegram to Holmwood till have seen you.
“I waited till I had seen you, as I said in my telegram.
A telegram came from Van Helsing at Amsterdam whilst I was at dinner, suggesting that I should be at Hillingham to-night, as it might be well to be at hand, and stating that he was leaving by the night mail and would join me early in the morning.
telegram, Van Helsing, Antwerp, to Seward, Carfax._ (Sent to Carfax, Sussex, as no county given; delivered late by twenty-two hours.)
The arrival of Van Helsing’s telegram filled me with dismay.
Did you not get my telegram?” I answered as quickly and coherently as I could that I had only got his telegram early in the morning, and had not lost a minute in coming here, and that I could not make any one in the house hear me.
“ He handed me a telegram:– “Have not heard from Seward for three days, and am terribly anxious. Cannot leave.
“ In the hall I met Quincey Morris, with a telegram for Arthur telling him that Mrs. Westenra was dead; that Lucy also had been ill, but was now going on better; and that Van Helsing and I were with her.
* _Later._--A sad home-coming in every way--the house empty of the dear soul who was so good to us; Jonathan still pale and dizzy under a slight relapse of his malady; and now a **telegram** from Van Helsing, whoever he may be:-- "You will be grieved to hear that Mrs. Westenra died five days ago, and that Lucy died the day before yesterday.
“ _telegram, Mrs. Harker to Van Helsing.
When we arrived at the Berkeley Hotel, Van Helsing found a telegram waiting for him:– “Am coming up by train.
I have sent a telegram to Jonathan to come on here when he arrives in London from Whitby.
“ About half an hour after we had received Mrs. Harker’s telegram, there came a quiet, resolute knock at the hall door.
“Nota bene, in Madam’s telegram he went south from Carfax, that means he went to cross the river, and he could only do so at slack of tide, which should be something before one o’clock.
Lord Godalming went to the Consulate to see if any telegram had arrived for him, whilst the rest of us came on to this hotel–“the Odessus.”
He had four telegrams, one each day since we started, and all to the same effect: that the Czarina Catherine had not been reported to Lloyd’s from anywhere.
He had arranged before leaving London that his agent should send him every day a telegram saying if the ship had been reported.
Daily telegrams to Godalming, but only the same story: “Not yet reported.”
telegram, October 24th.
We were all wild with excitement yesterday when Godalming got his telegram from Lloyd’s.
The telegrams from London have been the same: “no further report.”
* _28 October._--**telegram**.
“ Dr. Seward’s Diary. 28 October.–When the telegram came announcing the arrival in Galatz I do not think it was such a shock to any of us as might have been expected.
Get Keyword in Context#
We can also find out about a keyword’s more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.
To do so, we will first create a list of what’s called ngrams. “Ngrams” are any sequence of n tokens in a text. They’re an important concept in computational linguistics and NLP. (Have you ever played with Google’s Ngram Viewer?)
Below we’re going to make a list of bigrams, that is, all the two-word combinations from Dracula. We’re going to use these bigrams to find the neighboring words that appear alongside particular keywords.
#Make a list of tokens and POS labels from document if the token is a word
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
ngrams = []
adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
#Loop through numbers from 0 to the (slightly adjusted) length of your word list
for word_index in range(adj_length_of_word_list):
#Index the list at each number, grabbing the word at that number index as well as N number of words after it
ngram = word_list[word_index : word_index + number_consecutive_words]
#Append this word combo to the master list "ngrams"
ngrams.append(ngram)
return ngrams
bigrams = get_bigrams(tokens_and_labels)
Let’s take a peek at the bigrams:
bigrams[5:20]
[[('by', 'ADP'), ('Bram', 'PROPN')],
[('Bram', 'PROPN'), ('Stoker', 'PROPN')],
[('Stoker', 'PROPN'), ('This', 'DET')],
[('This', 'DET'), ('eBook', 'PROPN')],
[('eBook', 'PROPN'), ('is', 'AUX')],
[('is', 'AUX'), ('for', 'ADP')],
[('for', 'ADP'), ('the', 'DET')],
[('the', 'DET'), ('use', 'NOUN')],
[('use', 'NOUN'), ('of', 'ADP')],
[('of', 'ADP'), ('anyone', 'PRON')],
[('anyone', 'PRON'), ('anywhere', 'ADV')],
[('anywhere', 'ADV'), ('at', 'ADP')],
[('at', 'ADP'), ('no', 'DET')],
[('no', 'DET'), ('cost', 'NOUN')],
[('cost', 'NOUN'), ('and', 'CCONJ')]]
Now that we have our list of bigrams, we’re going to make a function get_neighbor_words()
. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the pos_label
parameter.
def get_neighbor_words(keyword, bigrams, pos_label = None):
neighbor_words = []
keyword = keyword.lower()
for bigram in bigrams:
#Extract just the lowercased words (not the labels) for each bigram
words = [word.lower() for word, label in bigram]
#Check to see if keyword is in the bigram
if keyword in words:
for word, label in bigram:
#Now focus on the neighbor word, not the keyword
if word.lower() != keyword:
#If the neighbor word matches the right pos_label, append it to the master list
if label == pos_label or pos_label == None:
neighbor_words.append(word.lower())
return Counter(neighbor_words).most_common()
get_neighbor_words("telegram", bigrams)
[('a', 6),
('from', 3),
('seward', 3),
('arthur', 2),
('the', 2),
('to', 2),
('my', 2),
('i', 2),
('came', 2),
('helsing', 2),
('his', 2),
('harker', 2),
('morris', 1),
('every', 1),
('see', 1),
('day', 1),
('back', 1),
('over', 1),
('it', 1),
('van', 1),
('filled', 1),
('early', 1),
('for', 1),
('waiting', 1),
('there', 1),
('madam', 1),
('he', 1),
('any', 1),
('had', 1),
('saying', 1),
('masts', 1),
('october', 1)]
get_neighbor_words("telegram", bigrams, pos_label='VERB')
[('came', 2), ('see', 1), ('filled', 1), ('waiting', 1), ('saying', 1)]
Your Turn!#
Try out find_sentences_with_keyword()
and get_neighbor_words
with your own keywords of interest.
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)