Part-of-Speech Tagging

Part-of-Speech Tagging#

In this lesson, we’re going to learn about the textual analysis methods part-of-speech tagging and keyword extraction. These methods will help us computationally parse sentences and better understand words in context.

We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for other languages:

[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.

-Claire Cain Miller, “Ada Lovelace,” NYT Overlooked Obituaries

Why is Part-of-Speech Tagging Useful?#

I don’t mean to go all Language Nerd on you, but parts of speech are important. Even if they seem kind of boring. Parts of speech are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.

By computationally identifying parts of speech, we can start computationally exploring syntax, the relationship between words — rather than only focusing on words in isolation, as we did with tf-idf. Though parts of speech may seem pedantic, they help computers (and us) crack at that ever-elusive abstract noun: meaning.

spaCy and Natural Language Processing (NLP)#

To computationally identify parts of speech, we’re going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more.

Install spaCy#

To use spaCy, we first need to install the library.

!pip install -U spacy

Import Libraries#

Then we’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 400
pd.options.display.max_colwidth =  400

We’re also going to import the Counter module for counting nouns, verbs, adjectives, etc., and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model#

Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

!python -m spacy download en_core_web_sm

Requirement already satisfied: en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 in /Users/melaniewalsh/anaconda3/lib/python3.7/site-packages (2.1.0)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.

spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.

Load Language Model#

Once the model is downloaded, we need to load it with spacy.load() and assign it to the variable nlp.

nlp = spacy.load('en_core_web_sm')

Create a Processed spaCy Document#

Whenever we use spaCy, our first step will be to create a processed spaCy document with the loaded NLP model nlp(). Most of the heavy NLP lifting is done in this line of code. After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

To test out spaCy’s part-of-speech tagging, we’ll begin by processing a sample sentence from Ada Lovelace’s obituary:

“[Charles] Babbage, who called [Ada Lovelace] the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.

This sentence makes for an interesting example because it is syntactically complex and because it includes contains difficultly ambiguous words such as “spell,” “abstract,” and “force.”

sample = """She “has thrown her magical spell around the most abstract of Sciences."""

document = nlp(sample)

spaCy Part-of-Speech Tagging#

POS	Description	Examples
ADJ	adjective	big, old, green, incomprehensible, first
ADP	adposition	in, to, during
ADV	adverb	very, tomorrow, down, where, there
AUX	auxiliary	is, has (done), will (do), should (do)
CONJ	conjunction	and, or, but
CCONJ	coordinating conjunction	and, or, but
DET	determiner	a, an, the
INTJ	interjection	psst, ouch, bravo, hello
NOUN	noun	girl, cat, tree, air, beauty
NUM	numeral	1, 2017, one, seventy-seven, IV, MMXIV
PART	particle	’s, not,
PRON	pronoun	I, you, he, she, myself, themselves, somebody
PROPN	proper noun	Mary, John, London, NATO, HBO
PUNCT	punctuation	., (, ), ?
SCONJ	subordinating conjunction	if, while, that
SYM	symbol	$, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB	verb	run, runs, running, eat, ate, eating
X	other	sfpksdpsxmsa
SPACE	space

Above is a POS chart taken from spaCy’s website, which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy’s POS tagging in action, we can use the spaCy module displacy on our sample document with the style= parameter set to “dep” (short for dependency parsing):

#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(document, style="dep", options=options)

As you can see, spaCy has correctly identified that “spell” and “force” are nouns in our sample sentence:

for token in document:
    if token.pos_ == "NOUN":
        print(token, token.pos_)

spell NOUN

But if we look at the same words in a different context — in a sentence that I made up — spaCy can identify when these words have changed grammatical roles and meanings.

You shouldn’t force someone to learn how to spell Babbage. They just need practice. You can’t abstract it.

document = nlp("You shouldn't force someone to learn how to spell Babbage. They just need practice. You can't abstract it.")

for token in document:
    if token.pos_ == "VERB":
        print(token, token.pos_)

force VERB
learn VERB
spell VERB
need VERB
abstract VERB

Where previously spaCy had identified “force” and “spell” as nouns, here spaCy correctly identifies the words “force,” “spell,” and “abstract” as verbs.

Get Part-Of-Speech Tags#

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the .pos_ attribute for each token. We can get even finer-grained dependency information with the attribute .dep_.

for token in document:
    print(token.text, token.pos_, token.dep_)

You PRON nsubj
should AUX aux
n't PART neg
force VERB ROOT
someone PRON dobj
to PART aux
learn VERB xcomp
how SCONJ advmod
to PART aux
spell VERB xcomp
Babbage NOUN dobj
. PUNCT punct
They PRON nsubj
just ADV advmod
need VERB ROOT
practice NOUN dobj
. PUNCT punct
You PRON nsubj
ca AUX aux
n't PART neg
abstract VERB ROOT
it PRON dobj
. PUNCT punct

Practicing with Dracula#

filepath = "../texts/literature/Dracula_Bram-Stoker.txt"
document = nlp(open(filepath, encoding="utf-8").read())

Get Adjectives#

POS	Description	Examples
ADJ	adjective	big, old, green, incomprehensible, first

To extract and count the adjectives in Dracula, we will follow the same model as above, except we’ll add an if statement that will pull out words only if their POS label matches “ADJ.”

Python Review

While we demonstrate how to extract parts of speech in the sections below, we’re also going to reinforce some integral Python skills. Notice how we use for loops and if statements to .append() specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.

Here we make a list of the adjectives identified in Dracula:

adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.text)

adjs

['*',
 'available',
 'next',
 'wonderful',
 'little',
 'correct',
 'possible',
 'western',
 'splendid',
 'noble',
 'Turkish',
 'good',
 'red',
 'good',
 'thirsty',
 'national',
 'able',
 'German',
 'useful',
 'able',
 'extreme',
 'Carpathian',
 'wildest',
 'least',
 'able',
 'exact',
 'own',
 'distinct',
 'latter',
 'eleventh',
 'known',
 'imaginative',
 'interesting',
 'comfortable',
 'thirsty',
 'continuous',
 'more',
 'mamaliga',
 'excellent',
 'impletata',
 'little',
 'more',
 'further',
 'unpunctual',
 'full',
 'little',
 'steep',
 'such',
 'old',
 'wide',
 'subject',
 'great',
 'strong',
 'outside',
 'clear',
 'short',
 'round',
 'picturesque',
 'pretty',
 'clumsy',
 'full',
 'white',
 'other',
 'most',
 'big',
 'strangest',
 'barbarian',
 'big',
 'great',
 'baggy',
 'dirty',
 'white',
 'white',
 'enormous',
 'heavy',
 'high',
 'long',
 'black',
 'heavy',
 'black',
 'picturesque',
 'old',
 'Oriental',
 'harmless',
 'natural',
 'dark',
 'interesting',
 'old',
 'stormy',
 'great',
 'terrible',
 'separate',
 'very',
 'seventeenth',
 'proper',
 'great',
 'old',
 'fashioned',
 'elderly',
 'usual',
 'white',
 'long',
 'double',
 'coloured',
 'tight',
 'elderly',
 'white',
 'happy',
 'beautiful',
 'best',
 'reticent',
 'true',
 'least',
 'old',
 'other',
 'frightened',
 'mysterious',
 'old',
 'hysterical',
 'young',
 'excited',
 'German',
 'other',
 'able',
 'many',
 'important',
 'fourth',
 'evil',
 'full',
 'such',
 'evident',
 'least',
 'ridiculous',
 'comfortable',
 'imperative',
 'such',
 'idolatrous',
 'ungracious',
 'old',
 'late',
 'old',
 'many',
 'ghostly',
 'easy',
 'usual',
 'high',
 'distant',
 'jagged',
 'big',
 'little',
 'mixed',
 'sleepy',
 'awake',
 'many',
 'odd',
 'red',
 'simple',
 'disagreeable',
 'most',
 'queer',
 'many',
 'same',
 'other',
 'considerable',
 'evil',
 'pleasant',
 'unknown',
 'unknown',
 'hearted',
 'sorrowful',
 'sympathetic',
 'last',
 'picturesque',
 'wide',
 'rich',
 'green',
 'wide',
 'whole',
 'big',
 'small',
 'ghostly',
 'able',
 'green',
 'sloping',
 'full',
 'steep',
 'blank',
 'gable',
 'bewildering',
 'green',
 'green',
 'pine',
 'rugged',
 'feverish',
 'bent',
 'excellent',
 'different',
 'general',
 'old',
 'good',
 'old',
 'foreign',
 'green',
 'mighty',
 'lofty',
 'full',
 'glorious',
 'beautiful',
 'deep',
 'purple',
 'green',
 'brown',
 'endless',
 'jagged',
 'snowy',
 'mighty',
 'white',
 'lofty',
 'serpentine',
 'right',
 'endless',
 'lower',
 'snowy',
 'delicate',
 'cool',
 'picturesque',
 'prevalent',
 'many',
 'peasant',
 'outer',
 'many',
 'new',
 'beautiful',
 'white',
 'delicate',
 'ordinary',
 'long',
 'like',
 'sure',
 'white',
 'coloured',
 'latter',
 'long',
 'cold',
 'dark',
 'dark',
 'pine',
 'great',
 'weird',
 'solemn',
 'grim',
 'strange',
 'like',
 'steep',
 'fierce',
 'grim',
 'enough',
 'such',
 'only',
 'dark',
 'other',
 'further',
 'long',
 'wild',
 'further',
 'greater',
 'crazy',
 'great',
 'stormy',
 'more',
 'several',
 'odd',
 'varied',
 'simple',
 'good',
 'kindly',
 'strange',
 'evil',
 'evident',
 'exciting',
 'slightest',
 'little',
 'last',
 'eastern',
 'dark',
 'rolling',
 'heavy',
 'oppressive',
 'thunderous',
 'dark',
 'only',
 'own',
 'white',
 'sandy',
 'white',
 'own',
 'low',
 'less',
 'German',
 'worse',
 'own:--',
 'next',
 'better',
 'next',
 'universal',
 'black',
 'splendid',
 'tall',
 'long',
 'brown',
 'great',
 'black',
 'bright',
 'red',
 'early',
 'stranger',
 'swift',
 'red',
 'sharp',
 'white',
 'dead',
 'strange',
 'same',
 'close',
 'prodigious',
 'late',
 'strange',
 'lonely',
 'excellent',
 'same',
 'little',
 'little',
 'frightened',
 'unknown',
 'hard',
 'complete',
 'straight',
 'same',
 'curious',
 'few',
 'general',
 'recent',
 'sick',
 'long',
 'agonised',
 'wild',
 'first',
 'rear',
 'sudden',
 'fright',
 'sharper',
 'same',
 'minded',
 'great',
 'few',
 'own',
 'accustomed',
 'quiet',
 'able',
 'extraordinary',
 'manageable',
 'great',
 'far',
 'narrow',
 'great',
 'colder',
 'colder',
 'fine',
 'white',
 'keen',
 'nearer',
 'afraid',
 'right',
 'faint',
 'blue',
 'same',
 'less',
 'asleep',
 'awful',
 'blue',
 'faint',
 'few',
 'strange',
 'optical',
 'same',
 'momentary',
 'blue',
 'worse',
 'black',
 'jagged',
 'beetling',
 'white',
 'red',
 'long',
 'shaggy',
 'more',
 'terrible',
 'grim',
 'such',
 'true',
 'peculiar',
 'painful',
 'only',
 'imperious',
 'long',
 'impalpable',
 'heavy',
 'strange',
 'uncanny',
 'dreadful',
 'afraid',
 'interminable',
 'complete',
 'occasional',
 'quick',
 'main',
 'conscious',
 'vast',
 'tall',
 'black',
 'broken',
 'jagged',
 'moonlit',
 'asleep',
 'awake',
 'remarkable',
 'considerable',
 'several',
 'dark',
 'great',
 'round',
 'bigger',
 'able',
 'prodigious',
 'great',
 'old',
 'large',
 'massive',
 'dim',
 'dark',
 'dark',
 'likely',
 'endless',
 'grim',
 'customary',
 'successful',
 'awake',
 'horrible',
 'awake',
 'awake',
 'patient',
 'heavy',
 'great',
 'massive',
 'loud',
 'long',
 'great',
 'tall',
 'old',
 'clean',
 'shaven',
 'long',
 'white',
 'single',
 'antique',
 'long',
 'open',
 'old',
 'right',
 'courtly',
 'excellent',
 'strange',
 'Welcome',
 'own',
 'cold',
 'dead',
 'Welcome',
 'akin',
 'same',
 'sure',
 'courtly',
 'late',
 'available',
 'great',
 'great',
 'open',
 'heavy',
 'mighty',
 'great',
 'small',
 'octagonal',
 'single',
 'welcome',
 'great',
 'top',
 'fresh',
 'hollow',
 'wide',
 'door:--',
 'ready',
 'other',
 'prepared',
 'courteous',
 'normal',
 'hasty',
 'other',
 'great',
 'graceful',
 'charming',
 'least',
 'constant',
 'happy',
 'sufficient',
 'possible',
 'young',
 'full',
 'own',
 'faithful',
 'discreet',
 'silent',
 'ready',
 'excellent',
 'old',
 'many',
 'same',
 'strong',
 'strong',
 'high',
 'thin',
 'arched',
 'lofty',
 'domed',
 'massive',
 'bushy',
 'own',
 'heavy',
 'cruel',
 'looking',
 'sharp',
 'white',
 'remarkable',
 'astonishing',
 'pale',
 'pointed',
 'broad',
 'strong',
 'general',
 'extraordinary',
 'white',
 'fine',
 'close',
 'coarse',
 'broad',
 'squat',
 'Strange',
 'long',
 'fine',
 'sharp',
 'rank',
 'horrible',
 'grim',
 'more',
 'protuberant',
 'own',
 'silent',
 'first',
 'dim',
 'strange',
 'many',
 'strange',
 'tired',
 'ready',
 'courteous',
 'octagonal',
 'strange',
 'own',
 'dear',
 'early',
 'last',
 'own',
 'cold',
 'hot',
 'absent',
 'hearty',
 'odd',
 'extraordinary',
 'round',
 'immense',
 'costliest',
 'beautiful',
 'fabulous',
 'old',
 'excellent',
 'little',
 'opposite',
 'great',
 'vast',
 'English',
 'whole',
 'full',
 'English',
 'recent',
 'varied',
 'political',
 'English',
 'such',
 'Blue',
 'hearty',
 'good',
 'glad',
 'sure',
 'much',
 'good',
 'past',
 'many',
 'many',
 'great',
 'crowded',
 'mighty',
 'whirl',
 'flattering',
 'little',
 'True',
 'enough',
 'noble',
 'common',
 'strange',
 'content',
 'long',
 'least',
 'other',
 'alone',
 'new',
 'English',
 'smallest',
 'sorry',
 'many',
 'important',
 'willing',
 'sure',
 'many',
 'strange',
 'strange',
 'much',
 'evident',
 'many',
 'strange',
 'blue',
 'certain',
 'last',
 'evil',
 'unchecked',
 'blue',
 'last',
 'little',
 'old',
 'aged',
 'artificial',
 'triumphant',
 'little',
 'friendly',
 'undiscovered',
 'sure',
 'long',
 'sharp',
 'dear',
 'own',
 'able',
 'right',
 'more',
 'dead',
 'other',
 'last',
 'own',
 'next',
 'interested',
 'myriad',
 'more',
 'needful',
 'alone',
 'other',
 'necessary',
 'ready',
 'suitable',
 'high',
 'ancient',
 'heavy',
 'large',
 'closed',
 'heavy',
 'old',
 'old',
 'sided',
 'cardinal',
 'solid',
 'many',
 'gloomy',
 'deep',
 'dark',
 'small',
 'clear',
 'fair',
 'sized',
 'large',
 'mediæval',
 'thick',
 'few',
 'high',
 'close',
 'old',
 'various',
 'straggling',
 'great',
 'few',
 'close',
 'large',
 'private',
 'lunatic',
 'visible',
 'glad',
 'old',
 'big',
 'old',
 'new',
 'habitable',
 'few',
 'old',
 'common',
 'dead',
 'bright',
 'much',
 'sparkling',
 'young',
 'gay',
 'young',
 'weary',
 'dead',
 'attuned',
 'many',
 'cold',
 'broken',
 'alone',
 'malignant',
 'saturnine',
 'little',
 'certain',
 'little',
 'new',
 'other',
 'better',
 'ready',
 'next',
 'excellent',
 'ready',
 'previous',
 'last',
 'conceivable',
 'late',
 'sleepy',
 'long',
 'tired',
 'preternatural',
 'clear',
 'remiss',
 'long',
 'dear',
 'new',
 'interesting',
 'own',
 'little',
 'warm',
 'glad',
 'first',
 'strange',
 'uneasy',
 'safe',
 'strange',
 'only',
 'prosaic',
 'few',
 'Good',
 'whole',
 'mistaken',
 'close',
 'whole',
 'startling',
 'many',
 'strange',
 'vague',
 'near',
 'little',
 'half',
 'instant',
 'dangerous',
 'wretched',
 'bauble',
 'heavy',
 'terrible',
 'annoying',
 'strange',
 'peculiar',
 'little',
 'magnificent',
 'very',
 'terrible',
 'green',
 'deep',
 'silver',
 'deep',
 'locked',
 'castle',
 'available',
 'veritable',
 'wild',
 'little',
 'other',
 'few',
 'mad',
 'helpless',
 'best',
 'definite',
 'certain',
 'own',
 'only',
 'open',
 'own',
 'desperate',
 'latter',
 'great',
 'own',
 'odd',
 'menial',
 'fright',
 'terrible',
 'terrible',
 'wild',
 'good',
 'good',
 'odd',
 'idolatrous',
 'tangible',
 'careful',
 'long',
 'few',
 'present',
 'own',
 'fascinating',
 'whole',
 'excited',
 'great',
 'white',
 'main',
 'race:--',
 'proud',
 'many',
 'brave',
 'European',
 'Ugric',
 'such',
 'fell',
 'warlike',
 'old',
 'great',
 'proud',
 'strange',
 'Hungarian',
 'Hungarian',
 'victorious',
 'more',
 'endless',
 'sleepless',
 'bloody',
 'warlike',
 'great',
 'own',
 'own',
 'own',
 'unworthy',
 'other',
 'later',
 'great',
 'bloody',
 'good',
 'Hungarian',
 'free',
 'young',
 'warlike',
 'precious',
 'dishonourable',
 'great',
 'bare',
 'meagre',
 'own',
 'Last',
 'legal',
 'certain',
 'certain',
 'useful',
 'more',
 'wise',
 'more',
 'certain',
 'practical',
 'local',
 'beautiful',
 'good',
 'Good',
 'strange',
 'local',
 'much',
 'more',
 'easy',
 'other',
 'local',
 'further',
 'such',
 'Good',
 'best',
 'wonderful',
 'much',
 'wonderful',
 'available',
 'first',
 'other',
 'young',
 'heavy',
 'other',
 'cold',
 'own',
 'smooth',
 'good',
 'young',
 'other',
 'thinnest',
 'foreign',
 'quiet',
 'sharp',
 'red',
 'careful',
 'able',
 'formal',
 'quiet',
 'several',
 'own',
 'third',
 'fourth',
 'second',
 'fourth',
 'about',
 'much',
 'private',
 'dear',
 'young',
 'other',
 'old',
 'many',
 'bad',
 'own',
 'safe',
 'careful',
 'gruesome',
 'only',
 'terrible',
 'unnatural',
 'horrible',
 'last',
 'freer',
 'little',
 'vast',
 'inaccessible',
 'narrow',
 'fresh',
 'nocturnal',
 'own',
 'full',
 'horrible',
 'terrible',
 'beautiful',
 'soft',
 'yellow',
 'light',
 'soft',
 'distant',
 'mere',
 'own',
 'tall',
 'deep',
 'complete',
 'many',
 'many',
 'interested',
 'wonderful',
 'small',
 'very',
 'whole',
 'castle',
 'dreadful',
 'great',
 'weird',
 'clear',
 ...]

Then we count the unique adjectives in this list with the Counter() module:

adjs_tally = Counter(adjs)

adjs_tally.most_common()

[('good', 198),
 ('old', 188),
 ('other', 185),
 ('own', 184),
 ('more', 178),
 ('great', 173),
 ('poor', 171),
 ('little', 164),
 ('dear', 151),
 ('much', 148),
 ('such', 129),
 ('last', 115),
 ('same', 110),
 ('white', 103),
 ('many', 100),
 ('terrible', 99),
 ('full', 97),
 ('long', 90),
 ('few', 86),
 ('strange', 85),
 ('first', 78),
 ('new', 73),
 ('ready', 71),
 ('dead', 69),
 ('red', 67),
 ('whole', 66),
 ('open', 66),
 ('sweet', 65),
 ('dark', 60),
 ('strong', 59),
 ('very', 57),
 ('true', 54),
 ('heavy', 53),
 ('young', 53),
 ('quick', 48),
 ('able', 47),
 ('happy', 47),
 ('right', 47),
 ('asleep', 47),
 ('big', 44),
 ('small', 43),
 ('sure', 43),
 ('better', 43),
 ('best', 41),
 ('cold', 41),
 ('wild', 41),
 ('close', 41),
 ('free', 41),
 ('late', 40),
 ('certain', 40),
 ('present', 40),
 ('afraid', 39),
 ('high', 38),
 ('quiet', 37),
 ('pale', 36),
 ('silent', 35),
 ('glad', 35),
 ('usual', 33),
 ('sad', 33),
 ('possible', 32),
 ('bad', 32),
 ('least', 31),
 ('beautiful', 31),
 ('low', 31),
 ('awful', 31),
 ('thin', 31),
 ('hard', 30),
 ('brave', 30),
 ('alone', 29),
 ('mad', 29),
 ('next', 28),
 ('deep', 28),
 ('anxious', 28),
 ('wonderful', 27),
 ('empty', 27),
 ('electronic', 27),
 ('black', 26),
 ('sharp', 26),
 ('half', 26),
 ('awake', 25),
 ('sudden', 25),
 ('horrible', 25),
 ('necessary', 25),
 ('fair', 25),
 ('safe', 25),
 ('Good', 25),
 ('grim', 24),
 ('bright', 24),
 ('fresh', 24),
 ('tired', 24),
 ('wide', 23),
 ('different', 23),
 ('only', 23),
 ('common', 22),
 ('satisfied', 22),
 ('noble', 21),
 ('short', 21),
 ('enough', 21),
 ('dreadful', 21),
 ('bitter', 21),
 ('weak', 21),
 ('odd', 20),
 ('Poor', 20),
 ('well', 19),
 ('round', 18),
 ('most', 18),
 ('evident', 18),
 ('worse', 18),
 ('ill', 18),
 ('real', 18),
 ('Dead', 18),
 ('simple', 17),
 ('less', 17),
 ('fine', 17),
 ('careful', 17),
 ('earnest', 17),
 ('wrong', 17),
 ('evil', 16),
 ('several', 16),
 ('past', 16),
 ('clever', 16),
 ('hypnotic', 16),
 ('latter', 15),
 ('sleepy', 15),
 ('fierce', 15),
 ('greater', 15),
 ('complete', 15),
 ('second', 15),
 ('angry', 15),
 ('cheerful', 15),
 ('blind', 15),
 ('excellent', 14),
 ('general', 14),
 ('tall', 14),
 ('Last', 14),
 ('soft', 14),
 ('sacred', 14),
 ('worth', 14),
 ('nice', 14),
 ('easy', 13),
 ('green', 13),
 ('blue', 13),
 ('large', 13),
 ('various', 13),
 ('foul', 13),
 ('calm', 13),
 ('fearful', 13),
 ('dearest', 13),
 ('surprised', 13),
 ('alive', 13),
 ('further', 12),
 ('clear', 12),
 ('important', 12),
 ('considerable', 12),
 ('early', 12),
 ('broken', 12),
 ('thick', 12),
 ('impossible', 12),
 ('sane', 12),
 ('nervous', 12),
 ('unhappy', 12),
 ('kind', 12),
 ('personal', 12),
 ('stern', 12),
 ('grateful', 12),
 ('useful', 11),
 ('unknown', 11),
 ('lower', 11),
 ('solemn', 11),
 ('faint', 11),
 ('conscious', 11),
 ('sufficient', 11),
 ('hot', 11),
 ('dangerous', 11),
 ('precious', 11),
 ('lovely', 11),
 ('human', 11),
 ('-', 11),
 ('violent', 11),
 ('grave', 11),
 ('selfish', 11),
 ('special', 11),
 ('public', 11),
 ('rough', 11),
 ('available', 10),
 ('interesting', 10),
 ('endless', 10),
 ('sick', 10),
 ('painful', 10),
 ('main', 10),
 ('broad', 10),
 ('willing', 10),
 ('uneasy', 10),
 ('near', 10),
 ('wise', 10),
 ('mere', 10),
 ('voluptuous', 10),
 ('deadly', 10),
 ('serious', 10),
 ('physical', 10),
 ('grey', 10),
 ('stronger', 10),
 ('tiny', 10),
 ('particular', 10),
 ('exact', 9),
 ('comfortable', 9),
 ('steep', 9),
 ('natural', 9),
 ('double', 9),
 ('mysterious', 9),
 ('excited', 9),
 ('distant', 9),
 ('foreign', 9),
 ('mighty', 9),
 ('lonely', 9),
 ('curious', 9),
 ('extraordinary', 9),
 ('narrow', 9),
 ('dim', 9),
 ('sorry', 9),
 ('previous', 9),
 ('desperate', 9),
 ('light', 9),
 ('unconscious', 9),
 ('worst', 9),
 ('difficult', 9),
 ('active', 9),
 ('tight', 8),
 ('ordinary', 8),
 ('like', 8),
 ('peculiar', 8),
 ('loud', 8),
 ('faithful', 8),
 ('warm', 8),
 ('vague', 8),
 ('helpless', 8),
 ('local', 8),
 ('former', 8),
 ('due', 8),
 ('similar', 8),
 ('resolute', 8),
 ('secret', 8),
 ('miserable', 8),
 ('funny', 8),
 ('absolute', 8),
 ('busy', 8),
 ('strait', 8),
 ('mortal', 8),
 ('spiritual', 8),
 ('eager', 8),
 ('left', 8),
 ('rare', 8),
 ('equal', 8),
 ('bent', 7),
 ('straight', 7),
 ('sharper', 7),
 ('far', 7),
 ('successful', 7),
 ('patient', 7),
 ('single', 7),
 ('courteous', 7),
 ('interested', 7),
 ('lunatic', 7),
 ('proud', 7),
 ('later', 7),
 ('about', 7),
 ('gentle', 7),
 ('More', 7),
 ('churchyard', 7),
 ('beloved', 7),
 ('regular', 7),
 ('accurate', 7),
 ('armed', 7),
 ('weaker', 7),
 ('rusty', 7),
 ('utmost', 7),
 ('mental', 7),
 ('infinite', 7),
 ('dry', 7),
 ('stertorous', 7),
 ('holy', 7),
 ('correct', 6),
 ('pretty', 6),
 ('frightened', 6),
 ('pleasant', 6),
 ('recent', 6),
 ('agonised', 6),
 ('fright', 6),
 ('hollow', 6),
 ('hearty', 6),
 ('English', 6),
 ('sized', 6),
 ('private', 6),
 ('visible', 6),
 ('weary', 6),
 ('bare', 6),
 ('legal', 6),
 ('yellow', 6),
 ('dusty', 6),
 ('upset', 6),
 ('fatal', 6),
 ('wooden', 6),
 ('diabolical', 6),
 ('intense', 6),
 ('favourite', 6),
 ('startled', 6),
 ('intellectual', 6),
 ('suspicious', 6),
 ('ghastly', 6),
 ('immediate', 6),
 ('picturesque', 5),
 ('proper', 5),
 ('hysterical', 5),
 ('ghostly', 5),
 ('jagged', 5),
 ('hearted', 5),
 ('blank', 5),
 ('cool', 5),
 ('swift', 5),
 ('accustomed', 5),
 ('vast', 5),
 ('massive', 5),
 ('likely', 5),
 ('clean', 5),
 ('prepared', 5),
 ('bushy', 5),
 ('cruel', 5),
 ('immense', 5),
 ('True', 5),
 ('startling', 5),
 ('instant', 5),
 ('silver', 5),
 ('castle', 5),
 ('bloody', 5),
 ('third', 5),
 ('brilliant', 5),
 ('golden', 5),
 ('cunning', 5),
 ('sensitive', 5),
 ('Same', 5),
 ('dizzy', 5),
 ('vain', 5),
 ('subtle', 5),
 ('despairing', 5),
 ('earthly', 5),
 ('homicidal', 5),
 ('vital', 5),
 ('concerned', 5),
 ('unusual', 5),
 ('restless', 5),
 ('additional', 5),
 ('loose', 5),
 ('humble', 5),
 ('younger', 5),
 ('pure', 5),
 ('apparent', 5),
 ('hush', 5),
 ('garlic', 5),
 ('pleased', 5),
 ('slow', 5),
 ('original', 5),
 ('individual', 5),
 ('thicker', 5),
 ('known', 4),
 ('dirty', 4),
 ('harmless', 4),
 ('fourth', 4),
 ('sorrowful', 4),
 ('snowy', 4),
 ('delicate', 4),
 ('weird', 4),
 ('keen', 4),
 ('nearer', 4),
 ('courtly', 4),
 ('normal', 4),
 ('charming', 4),
 ('constant', 4),
 ('pointed', 4),
 ('whirl', 4),
 ('content', 4),
 ('ancient', 4),
 ('gay', 4),
 ('mistaken', 4),
 ('locked', 4),
 ('definite', 4),
 ('warlike', 4),
 ('smooth', 4),
 ('sheer', 4),
 ('south', 4),
 ('wicked', 4),
 ('touching', 4),
 ('manifest', 4),
 ('rocky', 4),
 ('safer', 4),
 ('laden', 4),
 ('harsh', 4),
 ('lethal', 4),
 ('useless', 4),
 ('doubtless', 4),
 ('idle', 4),
 ('nearest', 4),
 ('married', 4),
 ('haired', 4),
 ('ashamed', 4),
 ('perfect', 4),
 ('Little', 4),
 ('unselfish', 4),
 ('sceptical', 4),
 ('Sacred', 4),
 ('ye', 4),
 ('firm', 4),
 ('zoöphagous', 4),
 ('anæmic', 4),
 ('sore', 4),
 ('greatest', 4),
 ('entire', 4),
 ('needless', 4),
 ('Russian', 4),
 ('steady', 4),
 ('ignorant', 4),
 ('agonising', 4),
 ('rosy', 4),
 ('strict', 4),
 ('deserted', 4),
 ('padded', 4),
 ('troubled', 4),
 ('unexpected', 4),
 ('reasonable', 4),
 ('slight', 4),
 ('lethargic', 4),
 ('desolate', 4),
 ('deeper', 4),
 ('bewildered', 4),
 ('ole', 4),
 ('tender', 4),
 ('intent', 4),
 ('feeble', 4),
 ('whiter', 4),
 ('unable', 4),
 ('amazed', 4),
 ('harrowing', 4),
 ('closer', 4),
 ('rude', 4),
 ('leaden', 4),
 ('unclean', 4),
 ('hungry', 4),
 ('official', 4),
 ('brute', 4),
 ('highest', 4),
 ('worried', 4),
 ('hellish', 4),
 ('stable', 4),
 ('Turkish', 3),
 ('thirsty', 3),
 ('German', 3),
 ('elderly', 3),
 ('rich', 3),
 ('lofty', 3),
 ('glorious', 3),
 ('purple', 3),
 ('brown', 3),
 ('outer', 3),
 ('eastern', 3),
 ('oppressive', 3),
 ('sandy', 3),
 ('colder', 3),
 ('uncanny', 3),
 ('occasional', 3),
 ('remarkable', 3),
 ('Strange', 3),
 ('needful', 3),
 ('suitable', 3),
 ('solid', 3),
 ('Hungarian', 3),
 ('sleepless', 3),
 ('practical', 3),
 ('formal', 3),
 ('gruesome', 3),
 ('freer', 3),
 ('unlocked', 3),
 ('nineteenth', 3),
 ('Great', 3),
 ('dreamy', 3),
 ('delightful', 3),
 ('repulsive', 3),
 ('shadowy', 3),
 ('intact', 3),
 ('uncertain', 3),
 ('key', 3),
 ('square', 3),
 ('post', 3),
 ('earthy', 3),
 ('powerful', 3),
 ('happier', 3),
 ('redder', 3),
 ('bloated', 3),
 ('quickest', 3),
 ('inclined', 3),
 ('worthy', 3),
 ('balanced', 3),
 ('drunk', 3),
 ('loving', 3),
 ('fit', 3),
 ('nigh', 3),
 ('wakeful', 3),
 ('devouring', 3),
 ('manifold', 3),
 ('wet', 3),
 ('dank', 3),
 ('inner', 3),
 ('foolish', 3),
 ('eyed', 3),
 ('fond', 3),
 ('haggard', 3),
 ('languid', 3),
 ('respectful', 3),
 ('religious', 3),
 ('jealous', 3),
 ('flit', 3),
 ('medical', 3),
 ('advanced', 3),
 ('truest', 3),
 ('appalling', 3),
 ('dull', 3),
 ('sullen', 3),
 ('gravely:--', 3),
 ('thankful', 3),
 ('narcotic', 3),
 ('probable', 3),
 ('nauseous', 3),
 ('sovereign', 3),
 ('bold', 3),
 ('stupid', 3),
 ('frightful', 3),
 ('frantic', 3),
 ('frequent', 3),
 ('decent', 3),
 ('aware', 3),
 ('monstrous', 3),
 ('dreary', 3),
 ('bowed', 3),
 ('negative', 3),
 ('daily', 3),
 ('unholy', 3),
 ('smaller', 3),
 ('actual', 3),
 ('limited', 3),
 ('electric', 3),
 ('huge', 3),
 ('future', 3),
 ('powerless', 3),
 ('gallant', 3),
 ('final', 3),
 ('higher', 3),
 ('Unclean', 3),
 ('orderly', 3),
 ('plain', 3),
 ('wily', 3),
 ('derivative', 3),
 ('applicable', 3),
 ('defective', 3),
 ('western', 2),
 ('splendid', 2),
 ('extreme', 2),
 ('Carpathian', 2),
 ('continuous', 2),
 ('subject', 2),
 ('clumsy', 2),
 ('strangest', 2),
 ('stormy', 2),
 ('fashioned', 2),
 ('coloured', 2),
 ('reticent', 2),
 ('ridiculous', 2),
 ('imperative', 2),
 ('idolatrous', 2),
 ('mixed', 2),
 ('queer', 2),
 ('sympathetic', 2),
 ('pine', 2),
 ('varied', 2),
 ('kindly', 2),
 ('exciting', 2),
 ('slightest', 2),
 ('stranger', 2),
 ('prodigious', 2),
 ('minded', 2),
 ('momentary', 2),
 ('imperious', 2),
 ('moonlit', 2),
 ('bigger', 2),
 ('shaven', 2),
 ('Welcome', 2),
 ('octagonal', 2),
 ('welcome', 2),
 ('top', 2),
 ('arched', 2),
 ('looking', 2),
 ('absent', 2),
 ('opposite', 2),
 ('political', 2),
 ('aged', 2),
 ('friendly', 2),
 ('myriad', 2),
 ('closed', 2),
 ('malignant', 2),
 ('prosaic', 2),
 ('veritable', 2),
 ('fascinating', 2),
 ('fell', 2),
 ('unnatural', 2),
 ('sidelong', 2),
 ('thorough', 2),
 ('hateful', 2),
 ('merciful', 2),
 ('intolerable', 2),
 ('super', 2),
 ('languorous', 2),
 ('slender', 2),
 ('vile', 2),
 ('villainy', 2),
 ('piteous', 2),
 ('ruthless', 2),
 ('tiniest', 2),
 ('aërial', 2),
 ('phantom', 2),
 ('extravagant', 2),
 ('naked', 2),
 ('British', 2),
 ('stately', 2),
 ('out:--', 2),
 ('swollen', 2),
 ('merry', 2),
 ('strained', 2),
 ('cursed', 2),
 ('handsome', 2),
 ('curly', 2),
 ('fancy', 2),
 ('American', 2),
 ('momentous', 2),
 ('playful', 2),
 ('honest', 2),
 ('confused', 2),
 ('ungrateful', 2),
 ('determined', 2),
 ('valuable', 2),
 ('excitable', 2),
 ('secure', 2),
 ('happiest', 2),
 ('sweeter', 2),
 ('grand', 2),
 ('stean', 2),
 ('stubble', 2),
 ('fed', 2),
 ('kitten', 2),
 ('raw', 2),
 ('exceptional', 2),
 ('easier', 2),
 ('giant', 2),
 ('aud', 2),
 ('mild', 2),
 ('downward', 2),
 ('lively', 2),
 ('incredible', 2),
 ('flat', 2),
 ('chief', 2),
 ('impatient', 2),
 ('superstitious', 2),
 ('First', 2),
 ('gone', 2),
 ('furious', 2),
 ('paler', 2),
 ('larger', 2),
 ('routine', 2),
 ('servile', 2),
 ('sublime', 2),
 ('shifty', 2),
 ('bulky', 2),
 ('unhurt', 2),
 ('exhausted', 2),
 ('fat', 2),
 ('functional', 2),
 ('malady', 2),
 ('stiff', 2),
 ('seeming', 2),
 ('frenzied', 2),
 ('recuperative', 2),
 ('beneficial', 2),
 ('alarmed', 2),
 ('beneficent', 2),
 ('stalwart', 2),
 ('answer:--', 2),
 ('cerebral', 2),
 ('Quick', 2),
 ('undone', 2),
 ('pallid', 2),
 ('crimson', 2),
 ('medicinal', 2),
 ('healthy', 2),
 ('poignant', 2),
 ('hospitable', 2),
 ('ead', 2),
 ('ome', 2),
 ('unnecessary', 2),
 ('pronounced', 2),
 ('peaceful', 2),
 ('surgical', 2),
 ('acrid', 2),
 ('sternest', 2),
 ('outstretched', 2),
 ('profound', 2),
 ('so', 2),
 ('unattended', 2),
 ('Unopened', 2),
 ('genial', 2),
 ('spirited', 2),
 ('lips:--', 2),
 ('sterner', 2),
 ('professional', 2),
 ('specific', 2),
 ('mortem', 2),
 ('shocked', 2),
 ('eternal', 2),
 ('direct', 2),
 ('hostile', 2),
 ('youthful', 2),
 ('moral', 2),
 ('logical', 2),
 ('forceful', 2),
 ('harder', 2),
 ('following', 2),
 ('silly', 2),
 ('indicative', 2),
 ('typewritten', 2),
 ('barren', 2),
 ('overwrought', 2),
 ('corporeal', 2),
 ('comparative', 2),
 ('numerous', 2),
 ('northern', 2),
 ('fewer', 2),
 ('horrid', 2),
 ('unhallowed', 2),
 ('Most', 2),
 ('Un', 2),
 ('rational', 2),
 ('puzzled', 2),
 ('frank', 2),
 ('set', 2),
 ('affected', 2),
 ('obedient', 2),
 ('careless', 2),
 ('callous', 2),
 ('livid', 2),
 ('horrified', 2),
 ('carnal', 2),
 ('devilish', 2),
 ('blessed', 2),
 ('Brave', 2),
 ('hideous', 2),
 ('awkward', 2),
 ('chronological', 2),
 ('thoughtful', 2),
 ('ultimate', 2),
 ('central', 2),
 ('neutral', 2),
 ('emotional', 2),
 ('overwhelmed', 2),
 ('appealing', 2),
 ('contemptuous', 2),
 ('apt', 2),
 ('elemental', 2),
 ('positive', 2),
 ('live', 2),
 ('meaner', 2),
 ('idiotic', 2),
 ('relieved', 2),
 ('conventional', 2),
 ('liable', 2),
 ('longer', 2),
 ('respectable', 2),
 ('amenable', 2),
 ('odour', 2),
 ('corrupt', 2),
 ('sound', 2),
 ('false', 2),
 ('indifferent', 2),
 ('unspeakable', 2),
 ('typical', 2),
 ('laconic', 2),
 ('weakest', 2),
 ('paralysed', 2),
 ('pitiful', 2),
 ('simplest', 2),
 ('forgetful', 2),
 ('holiest', 2),
 ('devoted', 2),
 ('Hush', 2),
 ('greenish', 2),
 ('flagged', 2),
 ('radiant', 2),
 ('latest', 2),
 ('sole', 2),
 ('doubtful', 2),
 ('alert', 2),
 ('predestinate', 2),
 ('criminal', 2),
 ('commercial', 2),
 ('Many', 2),
 ('FULL', 2),
 ('online', 2),
 ('readable', 2),
 ('widest', 2),
 ('exempt', 2),
 ('federal', 2),
 ('*', 1),
 ('national', 1),
 ('wildest', 1),
 ('distinct', 1),
 ('eleventh', 1),
 ('imaginative', 1),
 ('mamaliga', 1),
 ('impletata', 1),
 ('unpunctual', 1),
 ('outside', 1),
 ('barbarian', 1),
 ('baggy', 1),
 ('enormous', 1),
 ('Oriental', 1),
 ('separate', 1),
 ('seventeenth', 1),
 ('ungracious', 1),
 ('disagreeable', 1),
 ('sloping', 1),
 ('gable', 1),
 ('bewildering', 1),
 ('rugged', 1),
 ('feverish', 1),
 ('serpentine', 1),
 ('prevalent', 1),
 ('peasant', 1),
 ('crazy', 1),
 ('rolling', 1),
 ('thunderous', 1),
 ('own:--', 1),
 ('universal', 1),
 ('rear', 1),
 ('manageable', 1),
 ('optical', 1),
 ('beetling', 1),
 ('shaggy', 1),
 ('impalpable', 1),
 ('interminable', 1),
 ('customary', 1),
 ('antique', 1),
 ('akin', 1),
 ('door:--', 1),
 ('hasty', 1),
 ('graceful', 1),
 ('discreet', 1),
 ('domed', 1),
 ('astonishing', 1),
 ('coarse', 1),
 ('squat', 1),
 ('rank', 1),
 ('protuberant', 1),
 ('costliest', 1),
 ('fabulous', 1),
 ('Blue', 1),
 ('crowded', 1),
 ('flattering', 1),
 ('smallest', 1),
 ('unchecked', 1),
 ('artificial', 1),
 ('triumphant', 1),
 ('undiscovered', 1),
 ('sided', 1),
 ('cardinal', 1),
 ('gloomy', 1),
 ('mediæval', 1),
 ('straggling', 1),
 ('habitable', 1),
 ('sparkling', 1),
 ('attuned', 1),
 ('saturnine', 1),
 ('conceivable', 1),
 ('preternatural', 1),
 ('remiss', 1),
 ('wretched', 1),
 ('bauble', 1),
 ('annoying', 1),
 ('magnificent', 1),
 ('menial', 1),
 ('tangible', 1),
 ('race:--', 1),
 ('European', 1),
 ('Ugric', 1),
 ('victorious', 1),
 ('unworthy', 1),
 ('dishonourable', 1),
 ('meagre', 1),
 ('thinnest', 1),
 ('inaccessible', 1),
 ('nocturnal', 1),
 ('impregnable', 1),
 ('bygone', 1),
 ('curtainless', 1),
 ('unchanged', 1),
 ('wavy', 1),
 ('musical', 1),
 ('deliberate', 1),
 ('thrilling', 1),
 ('scarlet', 1),
 ('Lower', 1),
 ('lurid', 1),
 ('soulless', 1),
 ('aghast', 1),
 ('unquestionable', 1),
 ('unwound', 1),
 ('sanctuary', 1),
 ('madness', 1),
 ('fearless', 1),
 ('spoken', 1),
 ('smoothest', 1),
 ('letters:--', 1),
 ('cheery', 1),
 ('surest', 1),
 ('sturdy', 1),
 ('studded', 1),
 ('unloaded', 1),
 ('spade', 1),
 ('nebulous', 1),
 ('Quicker', 1),
 ('quicker', 1),
 ('dishevelled', 1),
 ('metallic', 1),
 ('vaporous', 1),
 ('Austrian', 1),
 ('Greek', 1),
 ('circular', 1),
 ('sickly', 1),
 ('heavier', 1),
 ('stony', 1),
 ('genuine', 1),
 ('real:--', 1),
 ('Close', 1),
 ('ponderous', 1),
 ('louder', 1),
 ('angrier', 1),
 ('fuller', 1),
 ('filthy', 1),
 ('clanging', 1),
 ('fast', 1),
 ('assistant', 1),
 ('stenographic', 1),
 ('hurried', 1),
 ('imperturbable', 1),
 ('tough', 1),
 ('psychological', 1),
 ('exquisite', 1),
 ('manly', 1),
 ('sloppy', 1),
 ('rarer', 1),
 ('appetite', 1),
 ('sanguine', 1),
 ('centripetal', 1),
 ('centrifugal', 1),
 ('paramount', 1),
 ('noblest', 1),
 ('romantic', 1),
 ('nicest', 1),
 ('mournful', 1),
 ('gnarled', 1),
 ('brusquely:--', 1),
 ('eatin', 1),
 ('cheap', 1),
 ('dictatorial', 1),
 ('illsome', 1),
 ('ireful', 1),
 ('acant', 1),
 ('slippy', 1),
 ('poorish', 1),
 ('aftest', 1),
 ('bier', 1),
 ('Lively', 1),
 ('dearly', 1),
 ('pious', 1),
 ('tombstean', 1),
 ('paved', 1),
 ('back', 1),
 ('wholesome', 1),
 ('Whole', 1),
 ('rudimentary', 1),
 ('sleek', 1),
 ('unprepared', 1),
 ('tame', 1),
 ('undeveloped', 1),
 ('opiate', 1),
 ('cumulative', 1),
 ('hopeless', 1),
 ...]

Then we make a dataframe from this list:

Pandas Review

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

	adj	count
0	good	198
1	old	188
2	other	185
3	own	184
4	more	178
5	great	173
6	poor	171
7	little	164
8	dear	151
9	much	148
10	such	129
11	last	115
12	same	110
13	white	103
14	many	100
15	terrible	99
16	full	97
17	long	90
18	few	86
19	strange	85
20	first	78
21	new	73
22	ready	71
23	dead	69
24	red	67
25	whole	66
26	open	66
27	sweet	65
28	dark	60
29	strong	59
30	very	57
31	true	54
32	heavy	53
33	young	53
34	quick	48
35	able	47
36	happy	47
37	right	47
38	asleep	47
39	big	44
40	small	43
41	sure	43
42	better	43
43	best	41
44	cold	41
45	wild	41
46	close	41
47	free	41
48	late	40
49	certain	40
50	present	40
51	afraid	39
52	high	38
53	quiet	37
54	pale	36
55	silent	35
56	glad	35
57	usual	33
58	sad	33
59	possible	32
60	bad	32
61	least	31
62	beautiful	31
63	low	31
64	awful	31
65	thin	31
66	hard	30
67	brave	30
68	alone	29
69	mad	29
70	next	28
71	deep	28
72	anxious	28
73	wonderful	27
74	empty	27
75	electronic	27
76	black	26
77	sharp	26
78	half	26
79	awake	25
80	sudden	25
81	horrible	25
82	necessary	25
83	fair	25
84	safe	25
85	Good	25
86	grim	24
87	bright	24
88	fresh	24
89	tired	24
90	wide	23
91	different	23
92	only	23
93	common	22
94	satisfied	22
95	noble	21
96	short	21
97	enough	21
98	dreadful	21
99	bitter	21

Get Nouns#

POS	Description	Examples
NOUN	noun	girl, cat, tree, air, beauty

To extract and count nouns, we can follow the same model as above, except we will change our if statement to check for POS labels that match “NOUN”.

nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.text)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

	noun	count
0	time	385
1	night	314
2	man	251
3	room	231
4	way	223
5	day	218
6	hand	202
7	door	199
8	face	197
9	eyes	188
10	things	171
11	friend	166
12	work	162
13	life	144
14	heart	140
15	men	138
16	place	133
17	house	133
18	window	116
19	sleep	112
20	blood	112
21	one	111
22	moment	106
23	head	104
24	hands	104
25	morning	98
26	thing	91
27	_	89
28	bed	89
29	death	88
30	mind	87
31	others	82
32	sort	81
33	child	74
34	fear	72
35	case	72
36	husband	72
37	rest	71
38	side	68
39	light	68
40	word	66
41	soul	65
42	world	62
43	part	61
44	days	61
45	box	61
46	ship	61
47	dear	60
48	water	59
49	end	59
50	lips	59
51	woman	57
52	look	57
53	hour	56
54	diary	56
55	horses	56
56	brain	55
57	body	55
58	sun	54
59	air	54
60	times	54
61	voice	52
62	fellow	51
63	words	50
64	earth	50
65	boxes	50
66	trouble	49
67	thought	49
68	mother	48
69	people	47
70	morrow	47
71	silence	47
72	letter	46
73	strength	46
74	cause	46
75	feet	46
76	power	46
77	kind	45
78	home	45
79	women	45
80	wolves	45
81	sunset	44
82	sea	43
83	key	43
84	o'clock	43
85	throat	43
86	patient	43
87	snow	42
88	teeth	42
89	knowledge	42
90	instant	41
91	friends	41
92	matter	41
93	duty	40
94	fire	40
95	none	40
96	coffin	40
97	sight	39
98	minutes	39
99	wind	39

Get Verbs#

POS	Description	Examples
VERB	verb	run, runs, running, eat, ate, eating

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if statement to match the POS label “VERB”).

Python Review

We can use a list comprehension to get our list of verbs in a single line of code! Closely examine the first line of code below:

verbs = [token.text for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

	verb	count
0	said	461
1	know	396
2	see	377
3	have	348
4	came	303
5	went	298
6	come	295
7	do	278
8	had	277
9	go	269
10	seemed	240
11	took	223
12	saw	216
13	think	216
14	made	196
15	looked	186
16	was	183
17	tell	175
18	get	168
19	make	163
20	got	156
21	found	154
22	is	153
23	told	144
24	say	141
25	asked	139
26	take	136
27	knew	130
28	done	128
29	find	114
30	let	113
31	want	112
32	began	109
33	put	106
34	thought	105
35	hear	101
36	coming	98
37	seen	95
38	look	94
39	keep	94
40	heard	91
41	looking	89
42	felt	86
43	turned	84
44	left	83
45	stood	80
46	opened	80
47	read	79
48	help	79
49	give	78
50	sleep	78
51	feel	77
52	held	73
53	seems	72
54	are	72
55	lay	72
56	gone	70
57	sat	69
58	ask	68
59	gave	67
60	going	65
61	believe	65
62	seem	64
63	spoke	64
64	try	64
65	has	64
66	set	63
67	fear	63
68	speak	62
69	tried	62
70	write	61
71	did	60
72	fell	57
73	kept	56
74	understand	55
75	passed	55
76	leave	55
77	be	55
78	suppose	53
79	ran	50
80	answered	50
81	grew	49
82	like	48
83	love	48
84	taken	47
85	used	47
86	were	46
87	lost	45
88	called	44
89	die	44
90	says	44
91	stopped	43
92	wanted	43
93	moved	43
94	wish	43
95	wait	42
96	mean	42
97	meet	42
98	given	42
99	laid	42

Keyword Extraction#

Get Sentences with Keyword#

spaCy can also identify sentences in a document. To access sentences, we can iterate through document.sents and pull out the .text of each sentence.

We can use spaCy’s sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below.

With the function find_sentences_with_keyword(), we will iterate through document.sents and pull out any sentence that contains a particular “keyword.” Then we will display these sentence with the keywords bolded.

import re
from IPython.display import Markdown, display

def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

find_sentences_with_keyword(keyword="telegram", document=document)

“ _telegram from Arthur Holmwood to Quincey P. Morris.

“ _telegram, Arthur Holmwood to Seward.

You must send to me the telegram every day; and if there be cause I shall come again.

telegram, Seward, London, to Van Helsing, Amsterdam._

“ telegram, Seward, London, to Van Helsing, Amsterdam.

“ telegram, Seward, London, to Van Helsing, Amsterdam. “6 September.–Terrible change for the worse.

I hold over telegram to Holmwood till have seen you.

“I waited till I had seen you, as I said in my telegram.

A telegram came from Van Helsing at Amsterdam whilst I was at dinner, suggesting that I should be at Hillingham to-night, as it might be well to be at hand, and stating that he was leaving by the night mail and would join me early in the morning.

telegram, Van Helsing, Antwerp, to Seward, Carfax._ (Sent to Carfax, Sussex, as no county given; delivered late by twenty-two hours.)

The arrival of Van Helsing’s telegram filled me with dismay.

Did you not get my telegram?” I answered as quickly and coherently as I could that I had only got his telegram early in the morning, and had not lost a minute in coming here, and that I could not make any one in the house hear me.

“ He handed me a telegram:– “Have not heard from Seward for three days, and am terribly anxious. Cannot leave.

“ In the hall I met Quincey Morris, with a telegram for Arthur telling him that Mrs. Westenra was dead; that Lucy also had been ill, but was now going on better; and that Van Helsing and I were with her.

  *  _Later._--A sad home-coming in every way--the house empty of the dear soul who was so good to us; Jonathan still pale and dizzy under a slight relapse of his malady; and now a **telegram** from Van Helsing, whoever he may be:--  "You will be grieved to hear that Mrs. Westenra died five days ago, and that Lucy died the day before yesterday.

“ _telegram, Mrs. Harker to Van Helsing.

When we arrived at the Berkeley Hotel, Van Helsing found a telegram waiting for him:– “Am coming up by train.

I have sent a telegram to Jonathan to come on here when he arrives in London from Whitby.

“ About half an hour after we had received Mrs. Harker’s telegram, there came a quiet, resolute knock at the hall door.

“Nota bene, in Madam’s telegram he went south from Carfax, that means he went to cross the river, and he could only do so at slack of tide, which should be something before one o’clock.

Lord Godalming went to the Consulate to see if any telegram had arrived for him, whilst the rest of us came on to this hotel–“the Odessus.”

He had four telegrams, one each day since we started, and all to the same effect: that the Czarina Catherine had not been reported to Lloyd’s from anywhere.

He had arranged before leaving London that his agent should send him every day a telegram saying if the ship had been reported.

Daily telegrams to Godalming, but only the same story: “Not yet reported.”

telegram, October 24th.

We were all wild with excitement yesterday when Godalming got his telegram from Lloyd’s.

The telegrams from London have been the same: “no further report.”

```
  *       _28 October._--**telegram**.
```

“ Dr. Seward’s Diary. 28 October.–When the telegram came announcing the arrival in Galatz I do not think it was such a shock to any of us as might have been expected.

Get Keyword in Context#

We can also find out about a keyword’s more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what’s called ngrams. “Ngrams” are any sequence of n tokens in a text. They’re an important concept in computational linguistics and NLP. (Have you ever played with Google’s Ngram Viewer?)

Below we’re going to make a list of bigrams, that is, all the two-word combinations from Dracula. We’re going to use these bigrams to find the neighboring words that appear alongside particular keywords.

#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

bigrams = get_bigrams(tokens_and_labels)

Let’s take a peek at the bigrams:

bigrams[5:20]

[[('by', 'ADP'), ('Bram', 'PROPN')],
 [('Bram', 'PROPN'), ('Stoker', 'PROPN')],
 [('Stoker', 'PROPN'), ('This', 'DET')],
 [('This', 'DET'), ('eBook', 'PROPN')],
 [('eBook', 'PROPN'), ('is', 'AUX')],
 [('is', 'AUX'), ('for', 'ADP')],
 [('for', 'ADP'), ('the', 'DET')],
 [('the', 'DET'), ('use', 'NOUN')],
 [('use', 'NOUN'), ('of', 'ADP')],
 [('of', 'ADP'), ('anyone', 'PRON')],
 [('anyone', 'PRON'), ('anywhere', 'ADV')],
 [('anywhere', 'ADV'), ('at', 'ADP')],
 [('at', 'ADP'), ('no', 'DET')],
 [('no', 'DET'), ('cost', 'NOUN')],
 [('cost', 'NOUN'), ('and', 'CCONJ')]]

Now that we have our list of bigrams, we’re going to make a function get_neighbor_words(). This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the pos_label parameter.

def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

get_neighbor_words("telegram", bigrams)

[('a', 6),
 ('from', 3),
 ('seward', 3),
 ('arthur', 2),
 ('the', 2),
 ('to', 2),
 ('my', 2),
 ('i', 2),
 ('came', 2),
 ('helsing', 2),
 ('his', 2),
 ('harker', 2),
 ('morris', 1),
 ('every', 1),
 ('see', 1),
 ('day', 1),
 ('back', 1),
 ('over', 1),
 ('it', 1),
 ('van', 1),
 ('filled', 1),
 ('early', 1),
 ('for', 1),
 ('waiting', 1),
 ('there', 1),
 ('madam', 1),
 ('he', 1),
 ('any', 1),
 ('had', 1),
 ('saying', 1),
 ('masts', 1),
 ('october', 1)]

get_neighbor_words("telegram", bigrams, pos_label='VERB')

[('came', 2), ('see', 1), ('filled', 1), ('waiting', 1), ('saying', 1)]

Your Turn!#

Try out find_sentences_with_keyword() and get_neighbor_words with your own keywords of interest.

find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)

get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)