Named Entity Recognition#

In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for NER with other languages:

  • Please reach out if you’re interested in adding another language!


Dataset#

Ada Lovelace’s Obituary & Louisa May Alcott’s Little Women#

A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843.

-Claire Cain Miller, “Ada Lovelace,” New York Times Overlooked Obituaries

Here’s a preview of spaC’s NER tagging Ada Lovelace’s obituary:


A gifted mathematician who is now recognized as the first ORDINAL computer programmer.By CLAIRE CAIN MILLER PERSON

A century DATE before the dawn of the computer age, Ada Lovelace PERSON imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard PERSON loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage’s PERSON Analytical Engine PERSON , was never built. But her writings about computing have earned Lovelace PERSON — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.

The program she wrote for the Analytical Engine was to calculate the seventh ORDINAL Bernoulli PERSON number. ( Bernoulli PERSON numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.

“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “Any piece of content, data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.

The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”

Lovelace PERSON , a British NORP socialite who was the daughter of Lord Byron ORG , the Romantic poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”

Math PERSON “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.

Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace PERSON Day signifies a celebration of women in technology. Lovelace PERSON lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.

“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner FAC .

Babbage ORG , who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”

Augusta Ada Byron ORG was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron ORG and Annabella Milbanke ORG . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron called the “princess of parallelograms” and, after their falling out, a “mathematical Medea” — was a social reformer from a wealthy family who had a deep interest in mathematics.

An etching from a portrait of Lovelace PERSON as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/Getty Images Lovelace ORG showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.

It was Somerville GPE who introduced Lovelace PERSON to Babbage when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .

She also met her husband, William King PERSON , through Somerville GPE . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace PERSON . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.

She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville GPE : “I now read Mathematics NORP every day and am occupied in Trigonometry PRODUCT and in preliminaries to Cubic GPE and Biquadratic Equations. So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”

In 1840 DATE , Lovelace PERSON asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”

It was in 1843 DATE , when she was 27 DATE , that Lovelace PERSON wrote her most lasting contribution to computer science.

She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes PRODUCT .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.

Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace PERSON ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”

Lovelace PERSON died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.

“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”

Claire Cain Miller PERSON writes about gender for The Upshot ORG . She first ORDINAL learned about Ada Lovelace PERSON while covering the tech industry, where women are severely underrepresented.

Why is NER Useful?#

Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we’ll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we’ll also do in a later lesson!).

Natural Language Processing (NLP)#

Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, or Siri? These are all examples of NLP in action!

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts and images are now getting eerily good.

Open-source NLP tools are getting very good, too. We’re going to use one of these open-source tools, the Python library spaCy, for our Named Entity Recognition tasks in this lesson.

How spaCy Works#

The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace’s New York Times obituary (something that we’ll test out for ourselves below). How does spaCy know that “Ada Lovelace” is a person and that “1843” is a date?

Well, spaCy doesn’t know, not for sure anyway. Instead, spaCy is making a very educated guess. This “guess” is based on what spaCy has learned about the English language after seeing lots of other examples.

That’s a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. These texts were, in fact, often labeled and corrected by hand. This is similar to our topic modeling work from the previous lesson, except our topic model wasn’t using labeled data.

The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.

When spaCy identifies people and places in Ada Lovelace’s obituary, in other words, the NLP model is actually making predictions about the text based on what it has learned about how people and places function in English-language sentences.

NER with spaCy#

Install spaCy#

!pip install -U spacy

Import Libraries#

We’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We’re also going to import the Counter module for counting people, places, and things, and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model#

Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

!python -m spacy download en_core_web_sm

Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.

spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.

Load Language Model#

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

import en_core_web_sm
nlp = en_core_web_sm.load()

2. We can load the model by name.

#nlp = spacy.load('en_core_web_sm')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

Process Document#

We first need to process our document with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and read Ada Lovelace’s obituary. Then we runnlp() on the text and create our document.

filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

spaCy Named Entities#

Below is a Named Entities chart for English-language spaCy taken from its website. This chart shows the different named entities that spaCy can identify as well as their corresponding type labels.

Type Label

Description

PERSON

People, including fictional.

NORP

Nationalities or religious or political groups.

FAC

Buildings, airports, highways, bridges, etc.

ORG

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

PRODUCT

Objects, vehicles, foods, etc. (Not services.)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK_OF_ART

Titles of books, songs, etc.

LAW

Named documents made into laws.

LANGUAGE

Any named language.

DATE

Absolute or relative dates or periods.

TIME

Times smaller than a day.

PERCENT

Percentage, including ”%“.

MONEY

Monetary values, including unit.

QUANTITY

Measurements, as of weight or distance.

ORDINAL

“first”, “second”, etc.

CARDINAL

Numerals that do not fall under another type.

To quickly see spaCy’s English-language NER in action, we can use the spaCy module displacy with the style= parameter set to “ent” (short for entities):

displacy.render(document, style="ent")
A gifted mathematician who is now recognized as the first ORDINAL computer programmer.By CLAIRE CAIN MILLER PERSON

A century DATE before the dawn of the computer age, Ada Lovelace PERSON imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard PERSON loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage’s PERSON Analytical Engine PERSON , was never built. But her writings about computing have earned Lovelace PERSON — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.

The program she wrote for the Analytical Engine was to calculate the seventh ORDINAL Bernoulli PERSON number. ( Bernoulli PERSON numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.

“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “Any piece of content, data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.

The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”

Lovelace PERSON , a British NORP socialite who was the daughter of Lord Byron ORG , the Romantic poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”

Math PERSON “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.

Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace PERSON Day signifies a celebration of women in technology. Lovelace PERSON lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.

“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner FAC .

Babbage ORG , who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”

Augusta Ada Byron ORG was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron ORG and Annabella Milbanke ORG . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron called the “princess of parallelograms” and, after their falling out, a “mathematical Medea” — was a social reformer from a wealthy family who had a deep interest in mathematics.

An etching from a portrait of Lovelace PERSON as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/Getty Images Lovelace ORG showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.

It was Somerville GPE who introduced Lovelace PERSON to Babbage when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .

She also met her husband, William King PERSON , through Somerville GPE . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace PERSON . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.

She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville GPE : “I now read Mathematics NORP every day and am occupied in Trigonometry PRODUCT and in preliminaries to Cubic GPE and Biquadratic Equations. So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”

In 1840 DATE , Lovelace PERSON asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”

It was in 1843 DATE , when she was 27 DATE , that Lovelace PERSON wrote her most lasting contribution to computer science.

She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes PRODUCT .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.

Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace PERSON ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”

Lovelace PERSON died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.

“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”

Claire Cain Miller PERSON writes about gender for The Upshot ORG . She first ORDINAL learned about Ada Lovelace PERSON while covering the tech industry, where women are severely underrepresented.

From a quick glance at the text above, we can see that the English-language spaCy model is doing quite well with NER. But it’s definitely not perfect.

Though spaCy correctly identifies “Ada Lovelace” as a PERSON in the first sentence, just a few sentences later it labels her as a WORK_OF_ART. Though spaCy correctly identifies “London” as a place GPE a few paragraphs down, it incorrectly identifies “Jacquard” as a place GPE, too (when really “Jacquard” is a type of loom, named after Marie Jacquard).

This inconsistency is very important to note and keep in mind. If we wanted to use spaCy’s English-language NER model for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn’t be perfect. That’s why understanding the limitations of this tool is so crucial. While spaCy’s English-language NER can be very good for identifying entities in broad strokes, it can’t be relied upon for anything exact and fine-grained — not out of the box anyway.

Get Named Entities#

All the named entities in our document can be found in the document.ents property. If we check out document.ents, we can see all the entities from Ada Lovelace’s obituary.

document.ents
(first,
 CLAIRE CAIN MILLER,
 A century,
 Ada Lovelace,
 1843,
 Jacquard,
 British,
 Charles Babbage’s,
 Analytical Engine,
 Lovelace,
 1852,
 36,
 first,
 seventh,
 Bernoulli,
 Bernoulli,
 Swiss,
 Jacob Bernoulli,
 Walter Isaacson,
 “The Innovators,
 The Analytical Engine,
 Lovelace,
 British,
 Lord Byron,
 Betty Alexandra Toole,
 Math,
 Lovelace,
 the mid-20th century,
 the Defense Department,
 October,
 Ada Lovelace,
 Lovelace,
 Lady Lovelace,
 The London Examiner,
 Babbage,
 Augusta Ada Byron,
 Dec. 10, 1815,
 London,
 Lord Byron,
 Annabella Milbanke,
 8,
 Lovelace,
 Smith Collection/Gado/Getty Images
 
  Lovelace,
 British,
 the day,
 Mary Somerville,
 Somerville,
 Lovelace,
 17,
 two-foot,
 almost two decades,
 William King,
 Somerville,
 1835,
 19,
 Lovelace,
 1839,
 two,
 Somerville,
 Mathematics,
 Trigonometry,
 Cubic,
 1840,
 Lovelace,
 Augustus De Morgan,
 London,
 first,
 1843,
 27,
 Lovelace,
 the Babbage Analytical Engine,
 nearly three,
 Notes,
 first,
 Ursula Martin,
 the University of Oxford,
 Lovelace,
 Lovelace,
 less than a decade later,
 Nov. 27, 1852,
 Notes,
 Claire Cain Miller,
 The Upshot,
 first,
 Ada Lovelace)

Each of the named entities in document.ents contains more information about itself, which we can access by iterating through the document.ents with a simple for loop.

For each named_entity in document.ents, we will extract the named_entity and its corresponding named_entity.label_.

for named_entity in document.ents:
    print(named_entity, named_entity.label_)
first ORDINAL
CLAIRE CAIN MILLER PERSON
A century DATE
Ada Lovelace PERSON
1843 DATE
Jacquard PERSON
British NORP
Charles Babbage’s PERSON
Analytical Engine PERSON
Lovelace PERSON
1852 DATE
36 CARDINAL
first ORDINAL
seventh ORDINAL
Bernoulli PERSON
Bernoulli PERSON
Swiss NORP
Jacob Bernoulli PERSON
Walter Isaacson PERSON
“The Innovators WORK_OF_ART
The Analytical Engine WORK_OF_ART
Lovelace PERSON
British NORP
Lord Byron ORG
Betty Alexandra Toole PERSON
Math PERSON
Lovelace PERSON
the mid-20th century DATE
the Defense Department ORG
October DATE
Ada Lovelace PERSON
Lovelace PERSON
Lady Lovelace PERSON
The London Examiner FAC
Babbage ORG
Augusta Ada Byron ORG
Dec. 10, 1815 DATE
London GPE
Lord Byron ORG
Annabella Milbanke ORG
8 DATE
Lovelace PERSON
Smith Collection/Gado/Getty Images

 Lovelace ORG
British NORP
the day DATE
Mary Somerville PERSON
Somerville GPE
Lovelace PERSON
17 DATE
two-foot QUANTITY
almost two decades DATE
William King PERSON
Somerville GPE
1835 DATE
19 DATE
Lovelace PERSON
1839 DATE
two CARDINAL
Somerville GPE
Mathematics NORP
Trigonometry PRODUCT
Cubic GPE
1840 DATE
Lovelace PERSON
Augustus De Morgan PERSON
London GPE
first ORDINAL
1843 DATE
27 DATE
Lovelace PERSON
the Babbage Analytical Engine ORG
nearly three CARDINAL
Notes PRODUCT
first ORDINAL
Ursula Martin PERSON
the University of Oxford ORG
Lovelace PERSON
Lovelace PERSON
less than a decade later DATE
Nov. 27, 1852 DATE
Notes PRODUCT
Claire Cain Miller PERSON
The Upshot ORG
first ORDINAL
Ada Lovelace PERSON

To extract just the named entities that have been identified as PERSON, we can add a simple if statement into the mix:

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)
CLAIRE CAIN MILLER
Ada Lovelace
Jacquard
Charles Babbage’s
Analytical Engine
Lovelace
Bernoulli
Bernoulli
Jacob Bernoulli
Walter Isaacson
Lovelace
Betty Alexandra Toole
Math
Lovelace
Ada Lovelace
Lovelace
Lady Lovelace
Lovelace
Mary Somerville
Lovelace
William King
Lovelace
Lovelace
Augustus De Morgan
Lovelace
Ursula Martin
Lovelace
Lovelace
Claire Cain Miller
Ada Lovelace

NER with Long Texts or Many Texts#

For the rest of this lesson, we’re going to work with Edward P. Jones’s short story collection Lost in the City, specifically the first story, “The Girl Who Raised Pigeons.”

filepath = "../texts/literature/Little-Women_Louisa-May-Alcott.txt"
text = open(filepath).read()
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))

Get People#

Type Label

Description

PERSON

People, including fictional.

To extract and count the people, we will use an if statement that will pull out words only if their “ent” label matches “PERSON.”

Pandas Review

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character count
0 Jo 1256
1 Amy 645
2 Laurie 570
3 Beth 465
4 Meg 311
5 John 144
6 Hannah 122
7 Brooke 96
8 Laurence 85
9 Bhaer 77
10 Teddy 51
11 Fred 40
12 Project Gutenberg 40
13 Daisy 29
14 Kate 27
15 Demi 24
16 Moffat 22
17 Margaret 21
18 Davis 18
19 Ned 18
20 Flo 18
21 Aunt 15
22 Frank 15
23 Dashwood 15
24 Esther 13
25 Marmee 12
26 Sallie 12
27 Zara 11
28 Scott 11
29 Roderigo 10
30 Chester 10
31 Hagar 9
32 Annie 9
33 Jo\n 9
34 John Brooke 9
35 Kirke 9
36 Fritz 9
37 Tina 8
38 Bethy 7
39 Carrol 7
40 Don Pedro 6
41 Shakespeare 6
42 Gardiner 6
43 God 6
44 Joanna 6
45 Lotty 6
46 Bangs 6
47 K. 6
48 Alcott 6
49 Project\nGutenberg 6
50 Jo decidedly 5
51 Jenny 5
52 Pickwick 5
53 Miss Crocker 5
54 Hummel 5
55 March 5
56 Tudor 5
57 Kitty 5
58 Grandpa 4
59 Annie Moffat 4
60 Raphael 4
61 King 4
62 Clara 4
63 Miss Belle 4
64 Ned Moffat 4
65 Crocker 4
66 Grace 4
67 Mary 4
68 Ellen Tree 4
69 Down 4
70 JO 4
71 Norton 4
72 Gott 4
73 Friedrich 4
74 Gutenberg 4
75 Josephine 3
76 Belsham 3
77 Susie 3
78 Cutter 3
79 Snow 3
80 Annie Moffat's 3
81 George 3
82 Belle 3
83 Miss 3
84 Tupman 3
85 Snodgrass 3
86 Sallie Gardiner 3
87 Fred Vaughn 3
88 pell-mell 3
89 David 3
90 Jimmy 3
91 gravely,-- 3
92 Hush 3
93 Jo felt 3
94 Sallie Moffat 3
95 Randal 3
96 Grundy 3
97 Presently Jo 3
98 Jack 3
99 Lamb 3
100 Miss Lamb 3
101 Thou 3
102 Friedrich Bhaer 3
103 Project Gutenberg-tm 3
104 Elizabeth 2
105 Don Pedro's 2
106 Theodore 2
107 Dora 2
108 gratefully,-- 2
109 Jo\neagerly 2
110 JAMES LAURENCE 2
111 Jenny Snow 2
112 Edgeworth 2
113 Lincoln 2
114 Nan 2
115 Jo stoutly 2
116 Dickens 2
117 Samuel Pickwick 2
118 Tracy Tupman 2
119 Nathaniel Winkle 2
120 Antonio 2
121 Sam Weller 2
122 Longmeadow 2
123 bush 2
124 Bon 2
125 Uncle 2
126 Kitty Bryant's 2
127 Johnson 2
128 Jo\ncarried 2
129 Jo warmly 2
130 John\n 2
131 chasséed 2
132 Jove 2
133 Eliott 2
134 Cornelius 2
135 Dove 2
136 Demijohn 2
137 Aunt Carrol 2
138 May 2
139 Killarney 2
140 Kate Kearney 2
141 Mees Marsch 2
142 Mamma 2
143 Plato 2
144 chubby 2
145 Mozart 2
146 homesick 2
147 Minna 2
148 Rob 2
149 Ted 2
150 LULU 2
151 Betty 2
152 Louisa May Alcott 2
153 Charles Dickens 2
154 Lulu 2
155 Undine 1
156 Sintram 1
157 Faber 1
158 Operatic Tragedy 1
159 Hurry 1
160 Meg\nwarmly 1
161 Presently Beth 1
162 Die Engel-kinder 1
163 Santa Claus 1
164 on,--when 1
165 Jo! 1
166 Miss Josephine 1
167 Christopher 1
168 Quel 1
169 pantoufles jolis 1
170 Laurie\ngood-naturedly 1
171 Buzz 1
172 arnica 1
173 Kings 1
174 Florence 1
175 Maria\nParks 1
176 Belsham]\n\n 1
177 Ellen 1
178 Susie Perkins 1
179 Chloe 1
180 Tom 1
181 brown house 1
182 Theodore\nLaurence 1
183 Jo arm-in 1
184 Laurie\nmount guard 1
185 Beth\nhid 1
186 James Laurence' 1
187 Amy March 1
188 Katy Brown 1
189 Mary Kingsley 1
190 Miss Snow 1
191 Blimber 1
192 Hem 1
193 Jo\nappeared 1
194 JO MEETS APOLLYON 1
195 Jo\ncrossly 1
196 Jo\nforgot 1
197 Mrs M. 1
198 M. 1
199 Miss\nBelle 1
200 Cinderella 1
201 Fisher 1
202 Meg]\n\n 1
203 &c.]\n\n 1
204 Knights 1
205 Tis 1
206 Unmask 1
207 the P. C. 1
208 Snowball Pat 1
209 Snowball 1
210 Pickwick Hall 1
211 Hannah Brown 1
212 BETH BOUNCER 1
213 Avenger 1
214 S. P. 1
215 A. S. 1
216 T. T. 1
217 N. W. 1
218 bona fide 1
219 Winkle 1
220 martin-house 1
221 Weller 1
222 Jo\nregarded 1
223 The P. O. 1
224 Sairy Gamp 1
225 Katy\nBrown's 1
226 Flora McFlimsey 1
227 Boaz 1
228 Jo hurried 1
229 Croaker 1
230 Laurie wrote,-- 1
231 Kate Vaughn 1
232 Sunshine 1
233 Barker 1
234 Fred; Laurie took Sallie 1
235 Jo\nangrily 1
236 Miss\nMarch 1
237 Count Gustave 1
238 What's 1
239 Thankee 1
240 Bosen 1
241 Fred, Sallie 1
242 Laurie to Jo 1
243 John Bull 1
244 Jo\nnodded 1
245 Truth 1
246 Mary Stuart 1
247 Meg heartily 1
248 Grace of Amy 1
249 Fred and Kate 1
250 refrain,-- 1
251 Yes'm 1
252 La 1
253 Beth\nmeekly 1
254 Laurie heartily 1
255 Ashamed 1
256 Bent 1
257 billiard saloon 1
258 tipsy 1
259 Angelo 1
260 Miss Burney 1
261 Lady Something 1
262 MARCH 1
263 Thomas 1
264 bag,-- 1
265 Greatheart 1
266 Meggy 1
267 Laurie I 1
268 Kiss 1
269 MA 1
270 contradick 1
271 Chick 1
272 Hattie King 1
273 Jo doos 1
274 eatin sweet 1
275 Yours Respectful 1
276 Hannah Mullet 1
277 Quartermaster Mullett 1
278 Lion 1
279 TEDDY 1
280 Puck 1
281 Call Meg 1
282 baker 1
283 Madam 1
284 Estelle 1
285 Mademoiselle 1
286 Testament 1
287 Jo\nafterward 1
288 Allyluyer 1
289 Amy Curtis 1
290 Theodore Laurence 1
291 Kitty Bryant 1
292 Anni 1
293 ESTELLE 1
294 {\n 1
295 Laurie soberly 1
296 Sabbath 1
297 Peggy 1
298 Caroline Percy 1
299 Mother 1
300 Jo\nsoothingly 1
301 Jo seriously 1
302 Sam 1
303 Rambler 1
304 wrapper,--was 1
305 Queen Bess 1
306 Madonna 1
307 Child 1
308 bilin 1
309 Jo\nblundered 1
310 Cook 1
311 independent,--so 1
312 James Laurence 1
313 Book 1
314 Sister Jo 1
315 grace,--a 1
316 Toodles 1
317 John begin 1
318 Parker 1
319 Gummidge 1
320 Mark 1
321 lace 1
322 Grecian 1
323 Jupiter Ammon 1
324 Uncle Carrol 1
325 Bacchus 1
326 Juliet 1
327 Michael Angelo 1
328 Maria Theresa 1
329 fête 1
330 Literary Lessons]\n\n XXVII 1
331 a People's Course 1
332 S. L. A. N. G. Northbury 1
333 Belzoni 1
334 Aim 1
335 Allen 1
336 Martha 1
337 jell 1
338 Jack Scott 1
339 Mantalini 1
340 Ned Moffat's 1
341 John dryly 1
342 Shut 1
343 Uncle Teddy 1
344 John Laurence 1
345 Megs 1
346 XXIX 1
347 Shylock 1
348 Maud 1
349 satin 1
350 May Chester's 1
351 Tom Brown 1
352 Tommy Chamberlain 1
353 Tommy 1
354 ones slip 1
355 May Chester 1
356 Lambs 1
357 do,--took 1
358 Hayes 1
359 Miss Jo 1
360 Lady Bountiful 1
361 Shun 1
362 Lennox 1
363 Ward 1
364 Robert Lennox's 1
365 Aunt Mary 1
366 Route de Roi 1
367 Noah 1
368 Fechter 1
369 Frank Vaughn 1
370 Beth Frank 1
371 Vaughns 1
372 Lawrence 1
373 Hogarth 1
374 Fred and Frank 1
375 parley vooing 1
376 cafés 1
377 Marie Antoinette's 1
378 Charlemagne 1
379 bijouterie 1
380 Bois 1
381 Chaise 1
382 Berne 1
383 Ariadne 1
384 showy 1
385 Blöndchen 1
386 Neckar 1
387 hands,--and 1
388 Jo said,-- 1
389 Cock 1
390 Bonnie Dundee 1
391 JO'S 1
392 Mabel 1
393 Thou shalt 1
394 Handsome 1
395 Lager Beer 1
396 Ursa Major 1
397 hose 1
398 Lucifer 1
399 nargerie_ 1
400 "P. S. On 1
401 Franz 1
402 L. 1
403 homey 1
404 Milton 1
405 Malaprop 1
406 Nick Bottom 1
407 Espagne 1
408 Jacks 1
409 compensation-- 1
410 Northbury 1
411 grammar 1
412 désillusionée_ 1
413 Kant 1
414 Hegel 1
415 Schiller 1
416 irresistible,-- 1
417 Jo stuffed 1
418 Sherwood 1
419 Sabbath-school 1
420 Hail 1
421 BETH 1
422 Jo saw\n 1
423 rebuke Jo 1
424 me,--busy 1
425 Victor Emmanuel 1
426 buff 1
427 Place Napoleon 1
428 Que 1
429 blasé_ 1
430 Avigdor 1
431 Corso 1
432 Schubert 1
433 mouchoir 1
434 Junoesque 1
435 Diana 1
436 Serene Something 1
437 Rothschild 1
438 Lady de Jones 1
439 satin train 1
440 Serene Teuton 1
441 Vladimir 1
442 Balzac 1
443 tulle 1
444 Daisy]\n\n 1
445 Vive la 1
446 Babyland 1
447 Englishman 1
448 Baby 1
449 Mornin 1
450 beseechingly,-- 1
451 John kissed 1
452 Brookes 1
453 Sallie Moffatt 1
454 Saxon 1
455 Laurie one 1
456 Meek 1
457 Alps 1
458 Dolce 1
459 Raphaella 1
460 Jouvin 1
461 mon 1
462 Au 1
463 glad 1
464 Aunty Beth 1
465 Jo\n_ 1
466 Requiem 1
467 Beethoven 1
468 Bach 1
469 Vevay 1
470 La Tour 1
471 Lausanne 1
472 Rousseau 1
473 tableau 1
474 Providence 1
475 Jo\nglanced 1
476 t.\n 1
477 Aunt Priscilla 1
478 Jo exclaim,-- 1
479 ab\nlibitum_ 1
480 Jo hung 1
481 Aristotle 1
482 Aunt Amy 1
483 Aunt Beth 1
484 Jo\nneglected 1
485 Jo some 1
486 Fessor 1
487 Hoffmann 1
488 Jo hastily 1
489 Jo\nbashfully 1
490 Catherine 1
491 Jo\ndecidedly 1
492 menagerie 1
493 Mother Bhaer 1
494 Bhaers 1
495 Cowley 1
496 Columella 1
497 Grandma 1
498 Unlucky Jo' 1
499 Amy warmly 1
500 Louisa M. Alcott's 1
501 Writings 1
502 Reginald B. Birch 1
503 Alice Barber Stephens 1
504 Jessie Willcox Smith 1
505 Harriet Roosevelt Richards 1
506 Girls.= 1
507 JO'S SCRAP-BAG 1
508 Amy\n\nIllustrated 1
509 Rose 1
510 Ben 1
511 Bob 1
512 COMIC 1
513 Foreword 1
514 Ednah D. Cheney 1
515 Sol Eytinge 1
516 Marjorie 1
517 Ethel 1
518 ASTOR 1
519 Aunt Wee 1
520 Elizabeth L. Gould 1
521 Little Women 1
522 underscores 1
523 Webster 1
524 Charles\nDickens 1
525 Augustus Snodgrass 1
526 Samuel Weller 1
527 Betsey 1
528 Tarantula 1
529 David Copperfield 1
530 A. M. Barnard 1
531 Jo's Scrap-Bag 1
532 Louisa M. Alcott 1
533 Michael\nHart 1
534 Project Gutenberg-tm's 1
535 S.\nFairbanks 1
536 Gregory B. Newby 1
537 Michael S. Hart 1

Get Places#

Type Label

Description

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

To extract and count places, we can follow the same model as above, except we will change our if statement to check for “ent” labels that match “GPE” or “LOC.” These are the type labels for “counties cities, states” and “locations, mountain ranges, bodies of water.”

places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place count
0 Washington 13
1 Nice 12
2 Paris 10
3 Belle 9
4 china 8
5 Rome 8
6 Tina 8
7 Demi 8
8 America 7
9 Plumfield 7
10 London 7
11 Bhaer 7
12 the United States 7
13 Hummels 6
14 Switzerland 6
15 Laurie 6
16 Germany 6
17 Vevay 5
18 Hum 5
19 Celestial City 4
20 Tudor 4
21 Valley 3
22 Sancho 3
23 Italy 3
24 France 3
25 New York 3
26 Berlin 3
27 Paradise 3
28 U.S. 3
29 Europe 2
30 quaver 2
31 Egypt 2
32 Satan 2
33 Jupiter 2
34 India 2
35 earth 2
36 Lottchen 2
37 Minna 2
38 A.M. 2
39 Baden-Baden 2
40 Scotts 2
41 1.E.8 2
42 the United\nStates 2
43 China 1
44 South 1
45 the City of Destruction 1
46 the Slough of Despond 1
47 Asia 1
48 Africa 1
49 Columbus 1
50 maroon 1
51 IV 1
52 Vicar 1
53 Kings 1
54 the Diamond Lake 1
55 Bremer 1
56 Moffats 1
57 Chiny 1
58 Bacon 1
59 Milton 1
60 capital,--so 1
61 Shadowy 1
62 Canada 1
63 Sahara 1
64 Atalanta 1
65 Breakfast 1
66 Pewmonia 1
67 Rappahannock 1
68 Heinrich 1
69 Mentor 1
70 Thou 1
71 us 1
72 Hercules 1
73 Romeo 1
74 Lisbon 1
75 Spiritualism 1
76 Jo 1
77 LONDON 1
78 Liverpool 1
79 Briton 1
80 Yankee 1
81 Kenilworth 1
82 Regent Street 1
83 Hyde Park 1
84 Wellington 1
85 Punch 1
86 Hampton 1
87 Richmond Park 1
88 Saint Denis 1
89 the Tuileries Gardens 1
90 Rhine 1
91 Bonn 1
92 Nassau 1
93 Byronic 1
94 NEW YORK 1
95 Corinne 1
96 Kirke 1
97 quarrel,--we 1
98 Sandwich 1
99 the Jardin Publique 1
100 Chauvain 1
101 Greece 1
102 Tarlatan 1
103 Cardiglia 1
104 Tarantula 1
105 Davises 1
106 india 1
107 character,--we 1
108 Babydom 1
109 Monaco 1
110 Baptiste 1
111 Mediterranean 1
112 Requiem 1
113 Vienna 1
114 Mendelssohn 1
115 St. Gingolf 1
116 Clarens 1
117 Weathercock 1
118 St. Martin 1
119 marmar 1
120 Truly 1
121 Hamburg 1
122 Professorin 1
123 West 1
124 Pomonas 1
125 Garland 1
126 WASHINGTON 1
127 BOSTON 1
128 New\n England 1
129 New Hampshire 1
130 Washington St. 1
131 Boston 1
132 Mass. 1
133 Passages 1
134 N. Winkle's 1
135 Fairfield 1
136 a United States 1
137 Replacement 1
138 Mississippi 1
139 Salt Lake City 1

Get Streets & Parks#

Type Label

Description

FAC

Buildings, airports, highways, bridges, etc.

To extract and count streets and parks (which show up a lot in Lost in the City!), we can follow the same model as above, except we will change our if statement to check for “ent” labels that match “FAC.” This is the type label for “buildings, airports, highways, bridges, etc.”

streets = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "FAC":
            streets.append(named_entity.text)

streets_tally = Counter(streets)

df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df
street count
0 Pickwick Hall 1
1 the Earl of Devereux 1
2 the Tower of Babel 1
3 the Barnville Theatre 1
4 Loved 1
5 Camp Laurence 1
6 the moon 1
7 the\ngate 1
8 Sphinx 1
9 the Bath Hotel 1
10 the Rue de Rivoli 1
11 Castle Hill 1
12 Difficulty 1
13 Camp 1
14 Page 123 1
15 Page 124 1
16 Page 411 1
17 Page 413 1
18 The Children's Friend Series 1
19 the Project Gutenberg™ License 1

Get Works of Art#

Type Label

Description

WORK_OF_ART

Titles of books, songs, etc.

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if statement to match the “ent” label “WORK_OF_ART”).

works_of_art = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "WORK_OF_ART":
            works_of_art.append(named_entity.text)

            art_tally = Counter(works_of_art)

df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df
work_of_art count
0 Meg 4
1 Merry Christmas 3
2 Aunt March 3
3 Dear me 3
4 the List of Illustrations 3
5 the "Heir of Redclyffe 2
6 Christopher Columbus 2
7 the "Busy Bee Society 2
8 Hamlet 2
9 O Laurie 2
10 Come, Jo 2
11 O Teddy 2
12 Love 2
13 Aunt Dodo 2
14 Little Men 2
15 Eight Cousins 2
16 Plain Vanilla ASCII 2
17 Little Tranquillity 1
18 Laurie Laurence,--what 1
19 Belsham's Essays by the hour 1
20 Vicar of Wakefield' 1
21 'Yes, ma'am 1
22 the "Mansion of Bliss 1
23 Dear Madam_,-- 1
24 Young ladies, you remember 1
25 Is Laurie an 1
26 Keep 1
27 Daisy March 1
28 Dear me! 1
29 Shall I go away 1
30 Aunt\nCockle-top 1
31 The Pickwick Portfolio 1
32 Poet's Corner 1
33 Lecture on "WOMAN AND HER POSITION 1
34 The Wide, Wide World 1
35 Scarlet 1
36 'The Sea-Lion 1
37 Dancing and French 1
38 Young ladies in America 1
39 Dear, how charming! 1
40 Illustration: Swinging to and fro 1
41 Please 1
42 Spirits 1
43 Delectable Mountain 1
44 Hurrah for Miss March 1
45 Spread Eagles 1
46 The Rival Painters 1
47 Won't Laurie laugh 1
48 Where's Laurie 1
49 'Take it 1
50 'Hope 1
51 A SONG FROM THE SUDS 1
52 Yours Respectful 1
53 "Water 1
54 About Meg 1
55 Stop, Jo 1
56 O Meg 1
57 Where is Laurie 1
58 Hanged if I do! 1
59 Hang the 'Rambler 1
60 So am I 1
61 Shall I tell you how 1
62 The Spread Eagle 1
63 Mother and I 1
64 Mother and I are going to wait for John 1
65 Run, Beth, 1
66 Spread Eagle 1
67 Send Beth 1
68 Curse of the Coventrys 1
69 Receipt\nBook 1
70 Daisy and Demi,--just the thing 1
71 Come, Jo, 1
72 Jo March 1
73 Yes, Amy was 1
74 Speak for yourself 1
75 Teddy 1
76 Didn't Hayes 1
77 Everything of Amy's 1
78 Teddy's Own 1
79 Aunt Carrol 1
80 Aunt and Flo 1
81 'The Flirtations of Capt 1
82 Rotten Row means ' 1
83 The Palais Royale 1
84 Olympia's Oath 1
85 Ah, Jo 1
86 'Friend of the old 1
87 Constant\n Tin Soldier 1
88 A Happy New Year 1
89 Shall I tell my friend 1
90 own,--a 1
91 Mees Marsch 1
92 Weekly Volcano 1
93 Demon of the Jura 1
94 Bless 1
95 Teddy, dear 1
96 Dear little bird 1
97 Wish I was 1
98 Lazy Laurence 1
99 Saint Laurence 1
100 Rarey with Puck 1
101 O Laurie, Laurie 1
102 Yes, Laurie 1
103 the church of\none 1
104 Dear Jo 1
105 Dear old f 1
106 Miss Marsch 1
107 IN THE GARRET 1
108 Mother Bhaer 1
109 Boys and How They Turned Out.= A Sequel 1
110 The Little Women Series 1
111 Roses and Forget-me-nots 1
112 Baa! Baa 1
113 "How They Ran Away 1
114 Water-Lilies 1
115 Shadow-Children 1
116 The Moss People 1
117 Little Women 1
118 Transcriber's 1
119 The Pickwick Papers 1
120 Library 1

Get NER in Context#

Hide code cell source
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        # all possible labels
        desired_ner_labels =  list(nlp.get_pipe('ner').labels)  

        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                print('---')
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))
for document in chunked_documents:
    get_ner_in_context('Jupiter', document)
---

LOC

By Jupiter I will, if I only get the chance!” cried Laurie, sitting up with sudden energy.

---

PERSON

A crash, a cry, and a laugh from Laurie, accompanied by the indecorous exclamation, “Jupiter Ammon!

---

LOC

“Twins, by Jupiter!” was all he said for a minute; then, turning to the women with an appealing look that was comically piteous, he added, “Take ‘em quick, somebody!

Your Turn!#

Now it’s your turn to take a crack at NER with a whole new text!

Type Label

Description

PERSON

People, including fictional.

NORP

Nationalities or religious or political groups.

FAC

Buildings, airports, highways, bridges, etc.

ORG

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

PRODUCT

Objects, vehicles, foods, etc. (Not services.)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK_OF_ART

Titles of books, songs, etc.

LAW

Named documents made into laws.

LANGUAGE

Any named language.

DATE

Absolute or relative dates or periods.

TIME

Times smaller than a day.

PERCENT

Percentage, including ”%“.

MONEY

Monetary values, including unit.

QUANTITY

Measurements, as of weight or distance.

ORDINAL

“first”, “second”, etc.

CARDINAL

Numerals that do not fall under another type.

In this section, you’re going to extract and count named entities from The Autobiography of Benjamin Franklin.

Open and read the text file

filepath = "../texts/literature/The-Autobiography-of-Benjamin-Franklin.txt"
text = open(filepath, encoding='utf-8').read()

To process the book in smaller chunks (if working in Binder or on a computer with memory constraints):

chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))

To process the book all at once (if working on a computer with a larger amount of memory):

document = nlp(text)

1. Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you’ve chosen) in the book. If you need help, study the examples above.

2. What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?

3. What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?

4. What’s an insight that you might be able to glean about the book based on your NER extraction?