Named Entity Recognition

In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.


Dataset

Ada Lovelace’s Obituary & Louisa May Alcott’s Little Women

A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843.

—Claire Cain Miller, “Ada Lovelace,” New York Times Overlooked Obituaries

Here’s a preview of spaC’s NER tagging Ada Lovelace’s obituary:


A gifted mathematician who is now recognized as the first ORDINAL computer programmer.By CLAIRE ORG CAIN MILLER

A century DATE before the dawn of the computer age, Ada Lovelace ORG imagined the modern-day DATE , general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard NORP loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage PERSON ’s Analytical Engine ORG , was never built. But her writings about computing have earned Lovelace ORG — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.

The program she wrote for the Analytical Engine ORG was to calculate the seventh ORDINAL Bernoulli ORG number. ( Bernoulli ORG numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.

“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “ Any piece of content WORK_OF_ART , data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.

The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”

Lovelace, a British NORP socialite who was the daughter of Lord Byron PERSON , the Romantic ORG poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”

Math “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.

Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace Day signifies a celebration of women in technology. Lovelace lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.

“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner ORG .

Babbage, who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences ORG and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”

Augusta Ada Byron PERSON was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron PERSON and Annabella Milbanke PERSON . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron PERSON called the “princess of parallelograms” and, after their falling out, a “mathematical Medea PERSON ” — was a social reformer from a wealthy family who had a deep interest in mathematics.

An etching from a portrait of Lovelace as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/ ORG Getty Images

Lovelace showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.

It was Somerville PERSON who introduced Lovelace PERSON to Babbage PERSON when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .

She also met her husband, William King PERSON , through Somerville PERSON . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace WORK_OF_ART . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.

She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville PERSON : “I now read Mathematics PERSON every day DATE and am occupied in Trigonometry GPE and in preliminaries to Cubic and Biquadratic Equations ORG . So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”

In 1840 DATE , Lovelace ORG asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”

It was in 1843 DATE , when she was 27 CARDINAL , that Lovelace PERSON wrote her most lasting contribution to computer science.

She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes WORK_OF_ART .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.

Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace ORG ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”

Lovelace died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.

“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”

Claire Cain Miller PERSON writes about gender for The Upshot WORK_OF_ART . She first ORDINAL learned about Ada Lovelace ORG while covering the tech industry, where women are severely underrepresented.

Why is NER Useful?

Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we’ll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we’ll also do in a later lesson!).

Natural Language Processing (NLP)

Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). What is NLP, exactly? NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, and Siri? These are all examples of NLP in action!

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts are now getting eerily good. (If you don’t believe me, check out this app that will autocomplete your sentences with GPT-2, a state-of-the-art text generation model. When I ran it, the model generated a mini-lecture from a “university professor” that sounds spookily close to home…)

../../_images/GPT-21.png

Open-source NLP tools are getting very good, too. We’re going to use one of these open-source tools, the Python library spaCy, for our Named Entity Recognition tasks in this lesson.

How spaCy Works

The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace’s New York Times obituary (something that we’ll test out for ourselves below). How does spaCy know that “Ada Lovelace” is a person and that “1843” is a date?

Well, spaCy doesn’t know, not for sure anyway. Instead, spaCy is making a very educated guess. This “guess” is based on what spaCy has learned about the English language after seeing lots of other examples.

That’s a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. (These texts were, in fact, often labeled and corrected by hand). This is similar to our topic modeling work from the previous lesson, except our topic model wasn’t using labeled data.

The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. (Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.)

When spaCy identifies people and places in Ada Lovelace’s obituary, in other words, the NLP model is actually making predictions about the text based on what it has learned about how people and places function in English-language sentences.

NER with spaCy

Install spaCy

!pip install -U spacy

Import Libraries

We’re going to import spacy and displacy, a special spaCy module for visualization.

import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We’re also going to import the Counter module for counting people, places, and things, and the pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).

Download Language Model

Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

!python -m spacy download en_core_web_sm

Note: spaCy offers models for other languages including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don’t currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as PyviKonlpy for Korean or Jieba for Chinese.

Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

import en_core_web_sm
nlp = en_core_web_sm.load()

2. We can load the model by name.

#nlp = spacy.load('en_core_web_sm')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

Process Document

We first need to process our document with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and read Ada Lovelace’s obituary. Then we runnlp() on the text and create our document.

filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

spaCy Named Entities

Below is a Named Entities chart taken from spaCy’s website, which shows the different named entities that spaCy can identify as well as their corresponding type labels.

Type Label

Description

PERSON

People, including fictional.

NORP

Nationalities or religious or political groups.

FAC

Buildings, airports, highways, bridges, etc.

ORG

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

PRODUCT

Objects, vehicles, foods, etc. (Not services.)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK_OF_ART

Titles of books, songs, etc.

LAW

Named documents made into laws.

LANGUAGE

Any named language.

DATE

Absolute or relative dates or periods.

TIME

Times smaller than a day.

PERCENT

Percentage, including ”%“.

MONEY

Monetary values, including unit.

QUANTITY

Measurements, as of weight or distance.

ORDINAL

“first”, “second”, etc.

CARDINAL

Numerals that do not fall under another type.

To quickly see spaCy’s NER in action, we can use the spaCy module displacy with the style= parameter set to “ent” (short for entities):

displacy.render(document, style="ent")
A gifted mathematician who is now recognized as the first ORDINAL computer programmer.By CLAIRE ORG CAIN MILLER

A century DATE before the dawn of the computer age, Ada Lovelace ORG imagined the modern-day DATE , general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard NORP loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage PERSON ’s Analytical Engine ORG , was never built. But her writings about computing have earned Lovelace ORG — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.

The program she wrote for the Analytical Engine ORG was to calculate the seventh ORDINAL Bernoulli ORG number. ( Bernoulli ORG numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.

“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “ Any piece of content WORK_OF_ART , data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.

The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”

Lovelace, a British NORP socialite who was the daughter of Lord Byron PERSON , the Romantic ORG poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”

Math “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.

Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace Day signifies a celebration of women in technology. Lovelace lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.

“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner ORG .

Babbage, who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences ORG and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”

Augusta Ada Byron PERSON was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron PERSON and Annabella Milbanke PERSON . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron PERSON called the “princess of parallelograms” and, after their falling out, a “mathematical Medea PERSON ” — was a social reformer from a wealthy family who had a deep interest in mathematics.

An etching from a portrait of Lovelace as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/ ORG Getty Images

Lovelace showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.

It was Somerville PERSON who introduced Lovelace PERSON to Babbage PERSON when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .

She also met her husband, William King PERSON , through Somerville PERSON . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace WORK_OF_ART . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.

She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville PERSON : “I now read Mathematics PERSON every day DATE and am occupied in Trigonometry GPE and in preliminaries to Cubic and Biquadratic Equations ORG . So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”

In 1840 DATE , Lovelace ORG asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”

It was in 1843 DATE , when she was 27 CARDINAL , that Lovelace PERSON wrote her most lasting contribution to computer science.

She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes WORK_OF_ART .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.

Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace ORG ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”

Lovelace died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.

“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”

Claire Cain Miller PERSON writes about gender for The Upshot WORK_OF_ART . She first ORDINAL learned about Ada Lovelace ORG while covering the tech industry, where women are severely underrepresented.

From a quick glance at the text above, we can see that spaCy is doing quite well with NER. But it’s definitely not perfect.

Though spaCy correctly identifies “Ada Lovelace” as a PERSON in the first sentence, just a few sentences later it labels her as a WORK_OF_ART. Though spaCy correctly identifies “London” as a place GPE a few paragraphs down, it incorrectly identifies “Jacquard” as a place GPE, too (when really “Jacquard” is a type of loom, named after Marie Jacquard).

This inconsistency is very important to note and keep in mind. If we wanted to use spaCy’s NER for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn’t be perfect. That’s why understanding the limitations of this tool is so crucial. While spaCy’s NER can be very good for identifying entities in broad strokes, it can’t be relied upon for anything exact and fine-grained — not out of the box anyway.

Get Named Entities

All the named entities in our document can be found in the document.ents property. If we check out document.ents, we can see all the entities from Ada Lovelace’s obituary.

document.ents
(first,
 CLAIRE,
 A century,
 Ada Lovelace,
 the modern-day,
 1843,
 Jacquard,
 British,
 Charles Babbage,
 Analytical Engine,
 Lovelace,
 1852,
 36,
 first,
 the Analytical Engine,
 seventh,
 Bernoulli,
 Bernoulli,
 Swiss,
 Jacob Bernoulli,
 Walter Isaacson,
 “The Innovators,
 Any piece of content,
 The Analytical Engine,
 British,
 Lord Byron,
 Romantic,
 Betty Alexandra Toole,
 Lovelace,
 the mid-20th century,
 the Defense Department,
 October,
 Lady Lovelace,
 The London Examiner,
 Sciences,
 Augusta Ada Byron,
 Dec. 10, 1815,
 London,
 Byron,
 Annabella Milbanke,
 8,
 Lord Byron,
 Medea,
 Smith Collection/Gado/,
 British,
 the day,
 Mary Somerville,
 Somerville,
 Lovelace,
 Babbage,
 17,
 two-foot,
 almost two decades,
 William King,
 Somerville,
 1835,
 19,
 the Countess of Lovelace,
 1839,
 two,
 Somerville,
 Mathematics,
 every day,
 Trigonometry,
 Cubic and Biquadratic Equations,
 1840,
 Lovelace,
 Augustus De Morgan,
 London,
 first,
 1843,
 27,
 Lovelace,
 Babbage Analytical Engine,
 nearly three,
 Notes,
 first,
 Ursula Martin,
 the University of Oxford,
 Lovelace,
 less than a decade later,
 Nov. 27, 1852,
 Notes,
 Claire Cain Miller,
 The Upshot,
 first,
 Ada Lovelace)

Each of the named entities in document.ents contains more information about itself, which we can access by iterating through the document.ents with a simple for loop.

For each named_entity in document.ents, we will extract the named_entity and its corresponding named_entity.label_.

for named_entity in document.ents:
    print(named_entity, named_entity.label_)

To extract just the named entities that have been identified as PERSON, we can add a simple if statement into the mix:

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)
Charles Babbage
Jacob Bernoulli
Walter Isaacson
Lord Byron
Betty Alexandra Toole
Lovelace
Lady Lovelace
Augusta Ada Byron
Byron
Annabella Milbanke
Lord Byron
Medea
Mary Somerville
Somerville
Lovelace
Babbage
William King
Somerville
Somerville
Mathematics
Augustus De Morgan
Lovelace
Ursula Martin
Claire Cain Miller

NER with Long Texts or Many Texts

For the rest of this lesson, we’re going to work with Edward P. Jones’s short story collection Lost in the City, specifically the first story, “The Girl Who Raised Pigeons.”

filepath = "../texts/literature/Little-Women.txt"
text = open(filepath).read()
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))

Get People

Type Label

Description

PERSON

People, including fictional.

To extract and count the people, we will use an if statement that will pull out words only if their “ent” label matches “PERSON.”

people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character count
0 Jo 1354
1 Laurie 581
2 Amy 580
3 Meg 545
4 Beth 460
5 John 145
6 Hannah 112
7 Brooke 100
8 Laurence 98
9 Bhaer 83
10 Teddy 61
11 Demi 47
12 Fred 44
13 Sallie 41
14 Kate 28
15 Daisy 25
16 Moffat 21
17 Margaret 20
18 Ned 19
19 Dashwood 18
20 Davis 17
21 Belle 16
22 Frank 16
23 Scott 14
24 Tina 13
25 Kirke 12
26 Zara 11
27 Fritz 10
28 Crocker 9
29 John Brooke 9
30 Tudor 9
31 papa 8
32 Grace 8
33 March 8
34 Chester 8
35 Roderigo 7
36 Aunt 7
37 Carrol 7
38 Alcott 7
39 Don Pedro 6
40 Shakespeare 6
41 Hagar 6
42 Pip 6
43 Hummel 6
44 Bangs 6
45 Esther 6
46 Lamb 6
47 Aunt Carrol 6
48 K. 6
49 Gardiner 5
50 Hush 5
51 Joanna 5
52 Jenny 5
53 Pickwick 5
54 Snodgrass 5
55 Lotty 5
56 Thou 5
57 Kitty 5
58 Belsham 4
59 King 4
60 Mademoiselle 4
61 Ned Moffat 4
62 Bethy 4
63 Mary 4
64 Ellen Tree 4
65 _ 4
66 Franz 4
67 Norton 4
68 Friedrich 4
69 Josephine 3
70 Gott 3
71 Susie 3
72 Cutter 3
73 Snow 3
74 George 3
75 Clara 3
76 Hortense 3
77 Brown 3
78 Sallie Gardiner 3
79 Fred Vaughn 3
80 David 3
81 Jimmy 3
82 Minna 3
83 Sallie Moffat 3
84 Randal 3
85 Parker 3
86 Grundy 3
87 Jack 3
88 May 3
89 Minnie 3
90 Friedrich Bhaer 3
91 Project Gutenberg-tm 3
92 Elizabeth 2
93 Apollyon 2
94 ma 2
95 Das 2
96 Don Pedro's 2
97 Theodore 2
98 Dora 2
99 Annie Moffat 2
100 Jo _ 2
101 Josy-phine 2
102 bob 2
103 Jenny Snow 2
104 JO 2
105 Lincoln 2
106 Nan 2
107 Dickens 2
108 Samuel Pickwick 2
109 Tracy Tupman 2
110 Nathaniel Winkle 2
111 Winkle 2
112 Sam Weller 2
113 Malaprop 2
114 Christopher Columbus 2
115 Lying 2
116 bandboxes 2
117 aloud 2
118 Down 2
119 bush 2
120 bein 2
121 Lottchen 2
122 Kitty Bryant's 2
123 Frenchwoman 2
124 Mis 2
125 Raphael 2
126 Eliott 2
127 Cornelius 2
128 Demijohn 2
129 Poor Jo 2
130 Kate Kearney 2
131 Flo 2
132 Emil 2
133 Scotts 2
134 Brookes 2
135 Baptiste 2
136 Meek 2
137 Mozart 2
138 Dodo 2
139 Aunt Dodo 2
140 Rob 2
141 Ted 2
142 Charles Dickens 2
143 Macbeth 1
144 Pilgrim 1
145 thee 1
146 Airy 1
147 Act 1
148 Santa Claus 1
149 Christopher 1
150 Columbus 1
151 spandy nice 1
152 cette jeune demoiselle 1
153 les pantoufles jolis 1
154 BURDENS 1
155 Kings 1
156 Florence 1
157 Maria\n 1
158 Parks 1
159 dahlia 1
160 Ellen 1
161 Susie Perkins 1
162 horrid!--and 1
163 Laugh 1
164 miles 1
165 mother,--one 1
166 Uncle Tom 1
167 Ivanhoe 1
168 Laurie forgot 1
169 Slough 1
170 Laurie rich 1
171 JAMES LAURENCE 1
172 James Laurence' 1
173 lingy 1
174 Katy Brown 1
175 Mary Kingsley 1
176 Miss Snow 1
177 Blimber 1
178 APOLLYON 1
179 queen 1
180 Bremer 1
181 Beth _ 1
182 Held Amy 1
183 Shivering 1
184 M. 1
185 Miss Clara 1
186 Cinderella 1
187 brooch 1
188 Miss Belle 1
189 s hang 1
190 Fisher 1
191 chickweed 1
192 diversions,--some 1
193 Gondola 1
194 gondola 1
195 Knights 1
196 the Lady Viola 1
197 Tis 1
198 Unmask 1
199 Ferdinand Devereux 1
200 Ferdinand 1
201 PICKWICK 1
202 Snowball Pat 1
203 S. B. PAT PAW 1
204 Snowball 1
205 Lecture 1
206 Hannah Brown 1
207 BETH BOUNCER 1
208 Constantine the Avenger 1
209 HINTS 1
210 S. P. 1
211 T. T. 1
212 N. W.\n 1
213 Welleresque 1
214 mails,--also 1
215 martin-house 1
216 Weller 1
217 The P. O. 1
218 Sairy Gamp 1
219 Katy 1
220 Flora McFlimsey 1
221 Boaz 1
222 Language 1
223 Laurie wrote,--\n\n 1
224 Kate Vaughn 1
225 Sunshine 1
226 Barker 1
227 Leghorn Laurie 1
228 Longmeadow 1
229 acorns 1
230 Thankee 1
231 Bosen 1
232 mermaid 1
233 Fred, Sallie 1
234 John Bull 1
235 Miss Kate 1
236 Mary Stuart 1
237 woe 1
238 Englishwoman 1
239 flung 1
240 blunt Jo 1
241 Bent 1
242 SECRETS 1
243 Scrabble 1
244 stairs 1
245 rang 1
246 Angelo 1
247 doin 1
248 Burney 1
249 before,--that 1
250 barber 1
251 Thomas 1
252 chestnut lock 1
253 Breakfast 1
254 Greatheart 1
255 Coffee 1
256 Meggy 1
257 Kiss 1
258 Merci 1
259 Papa 1
260 accordin 1
261 wearin 1
262 Hannah Mullet 1
263 Rappahannock 1
264 Quartermaster Mullett keeps 1
265 MADAM,--\n\n 1
266 Glad 1
267 Scarlet 1
268 sore throat 1
269 Call Meg 1
270 Hannah _ 1
271 Amy _ 1
272 ady 1
273 sech 1
274 Beth day 1
275 baker 1
276 Divine 1
277 Weary Hannah 1
278 Hark 1
279 Mop 1
280 Estelle 1
281 Madame 1
282 Pro-cras-ti 1
283 Protestant 1
284 Allyluyer 1
285 Amy Curtis 1
286 Theodore Laurence 1
287 Noter Dame 1
288 Kitty Bryant 1
289 Anni Domino 1
290 CONFIDENTIAL 1
291 Meg marry him 1
292 Caroline Percy 1
293 John _ 1
294 Grandfather 1
295 Sam 1
296 Queen Bess 1
297 Purrer 1
298 Madonna 1
299 Child 1
300 mum 1
301 bilin 1
302 Brooke,--at 1
303 now,--for 1
304 my John_ 1
305 Annie Moffat's 1
306 Sha'n't 1
307 Cook 1
308 James Laurence 1
309 Book 1
310 Sister Jo 1
311 GOSSIP 1
312 talked slang 1
313 frank confession 1
314 decorums 1
315 le brown house 1
316 pell 1
317 Psyche Laurie 1
318 merry words 1
319 knives 1
320 Toodles 1
321 Henshaw 1
322 Gummidge 1
323 Mark 1
324 Laurie frown 1
325 Uncle Carrol 1
326 Bacchus 1
327 Romeo 1
328 Madonnas 1
329 Michael Angelo 1
330 Maria Theresa 1
331 Nil 1
332 S. L. A. N. G. Northbury 1
333 Belzoni 1
334 Aim 1
335 Allen 1
336 robin 1
337 Keats 1
338 Martha 1
339 mutton 1
340 Receipt\nBook 1
341 Jack Scott 1
342 bemoan 1
343 Niobe 1
344 Mantalini 1
345 Ned Moffat's 1
346 John dryly 1
347 Presently Jo 1
348 Uncle Teddy 1
349 John Laurence 1
350 Megs 1
351 Shylock 1
352 Calm 1
353 Good_-by 1
354 Tom Brown 1
355 Tommy Chamberlain 1
356 Tommy 1
357 Framed 1
358 Killarney 1
359 Shun 1
360 Briton 1
361 Ward 1
362 Robert Lennox's 1
363 Kenilworth 1
364 Aunt Mary 1
365 Aye 1
366 -skelter 1
367 Punch 1
368 Route de Roi_ 1
369 Noah 1
370 Fechter 1
371 Frank Vaughn 1
372 Beth Frank 1
373 Camp Laurence 1
374 fellows,--especially Fred 1
375 Rainy 1
376 Marie Antoinette's 1
377 Bois 1
378 Père la\n Chaise 1
379 knew,--except Laurie 1
380 Goethe 1
381 Dannecker 1
382 Ariadne 1
383 Blöndchen_ 1
384 he-- 1
385 extravagances,--sending 1
386 Mentor 1
387 Cock 1
388 bonnie Dundee 1
389 Mabel 1
390 Prut 1
391 Minnie\n Kirke 1
392 Lager Beer 1
393 Ursa Major 1
394 Bon 1
395 Dis 1
396 I. 1
397 to,--he 1
398 L. 1
399 Milton 1
400 Nick Bottom 1
401 Jacks 1
402 Northbury 1
403 Goot 1
404 Wallenstein 1
405 merry bass-viol 1
406 Sherwood 1
407 Hannah More 1
408 Phillips 1
409 Hail 1
410 lie orange-orchards 1
411 Victor Emmanuel 1
412 Que pensez 1
413 Avigdor 1
414 Genius 1
415 Tower 1
416 Corsica 1
417 Junoesque 1
418 Diana 1
419 Misses Davis 1
420 Cardiglia 1
421 Serene Something 1
422 Rothschild 1
423 Jew 1
424 Joneses 1
425 Vladimir 1
426 Balzac 1
427 Femme 1
428 Day 1
429 make,--forgotten 1
430 John likes,--talk 1
431 Mamma 1
432 Will Demi 1
433 Mornin 1
434 John the Just 1
435 Sallie Moffatt 1
436 Raphaella 1
437 Saint Laurence 1
438 Jouvin 1
439 Rarey 1
440 Paradise 1
441 Au 1
442 Aunty Beth 1
443 Earthly 1
444 Laurie good 1
445 Mendelssohn 1
446 Beethoven 1
447 Bach 1
448 kinder 1
449 gloomy St. Gingolf 1
450 Mont St. Bernard 1
451 du Midi 1
452 Lausanne 1
453 Rousseau 1
454 Clarens 1
455 XLII 1
456 Grief 1
457 Providence 1
458 Johnson 1
459 headstrong 1
460 Beth lay ill 1
461 Mercy 1
462 Marsch,--but 1
463 Monsieur de Trop 1
464 Know'st 1
465 Récamier 1
466 Aristotle 1
467 Happy Amy 1
468 Alcibiades 1
469 Daisy make patty-cakes 1
470 Aunt Amy 1
471 Aunt Beth 1
472 kindred 1
473 grandson 1
474 bübchen 1
475 thou beginnest 1
476 Fessor 1
477 Marsch 1
478 mein Gott 1
479 Catherine 1
480 lads,--a 1
481 Dicks 1
482 Bhaer-garten 1
483 Mother Bhaer 1
484 God 1
485 grandma 1
486 sixtieth 1
487 Grandma 1
488 Unlucky Jo' 1
489 Tommy Bangs 1
490 Reginald B. Birch 1
491 Alice Barber Stephens 1
492 Jessie Willcox Smith 1
493 Harriet Roosevelt Richards 1
494 Cupid 1
495 WOMEN 1
496 Rose 1
497 Bob 1
498 Betty 1
499 JACK 1
500 JILL 1
501 Ednah D. Cheney 1
502 SHAWL-STRAPS 1
503 Sol Eytinge 1
504 MEPHISTOPHELES 1
505 Marjorie 1
506 Baa 1
507 QUEEN ASTOR 1
508 Elizabeth L. Gould 1
509 Louisa 1
510 Amy_ 1
511 transcribe red- 1
512 ferrule 1
513 Webster 1
514 Charles\nDickens 1
515 Augustus Snodgrass 1
516 Samuel Weller 1
517 N. Winkle's 1
518 Betsey 1
519 Peggotty 1
520 David Copperfield 1
521 know'st thou 1
522 transcribe Dove-cote 1
523 Flora 1
524 Louisa M. Alcott 1
525 eBooks 1
526 FULL LICENSE 1
527 Michael\n 1
528 S.\n 1
529 Gregory B. Newby 1
530 Michael S. Hart 1
531 Web 1

Get Places

Type Label

Description

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

To extract and count places, we can follow the same model as above, except we will change our if statement to check for “ent” labels that match “GPE” or “LOC.” These are the type labels for “counties cities, states” and “locations, mountain ranges, bodies of water.”

places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place count
0 Marmee 23
1 Washington 13
2 Paris 10
3 America 8
4 Rome 8
5 London 7
6 the United States 7
7 Switzerland 6
8 china 6
9 Hannah 6
10 Germany 6
11 Hagar 5
12 Plumfield 5
13 Project Gutenberg 5
14 Italy 3
15 Bethy 3
16 Moffats 3
17 France 3
18 turkey 3
19 New York 3
20 Berlin 3
21 Tina 3
22 U.S. 3
23 Undine 2
24 Sintram 2
25 Valley 2
26 Europe 2
27 Egypt 2
28 the\nriver 2
29 Vaughns 2
30 Celestial City 2
31 Jupiter 2
32 India 2
33 A.M. 2
34 Berne 2
35 Baden-Baden 2
36 Nice 2
37 Garland 2
38 Illustrated 2
39 Gutenberg 2
40 China 1
41 Banquo 1
42 South 1
43 the City of Destruction 1
44 the Slough of Despond to-night 1
45 Asia 1
46 Africa 1
47 chintz 1
48 maroon 1
49 Vevay 1
50 Heidelberg 1
51 Belsham 1
52 Wakefield 1
53 Latin, Algebra 1
54 tarlatan 1
55 Chiny 1
56 mignonette 1
57 larkspur 1
58 VENICE 1
59 Winkle 1
60 Tupman 1
61 Bacon 1
62 Milton 1
63 Longmeadow 1
64 rations,--I'll 1
65 the\nMountains 1
66 Canada 1
67 blue river 1
68 ruddy 1
69 Atalanta 1
70 the House of March 1
71 earth 1
72 Chick 1
73 Pewmonia 1
74 Heinrich 1
75 Minna 1
76 the bay-window 1
77 new kingdom 1
78 him,--and 1
79 the\nvalley 1
80 Hercules 1
81 Sphinx 1
82 Lisbon 1
83 Spiritualism 1
84 deplore 1
85 Forum 1
86 LONDON 1
87 Halifax 1
88 us,--Mr. 1
89 the Lakes of Killarney 1
90 aye 1
91 mum 1
92 Devonshire 1
93 Wellington 1
94 PARIS 1
95 Hogarth 1
96 Richmond Park 1
97 Saint Denis 1
98 the Tuileries Gardens 1
99 Luxembourg Gardens 1
100 Rhine 1
101 Bonn 1
102 saw,--the river 1
103 Nassau 1
104 Byronic 1
105 sofa,--long 1
106 NEW YORK 1
107 Märchen 1
108 Nile 1
109 the frothy sea 1
110 Kant 1
111 Hegel 1
112 Promenade 1
113 Villa Franca 1
114 Schubert 1
115 Greece 1
116 Tarlatan 1
117 Continent 1
118 Pole 1
119 india 1
120 Babyland 1
121 happiest kingdom 1
122 Monaco 1
123 Baptiste 1
124 Paradise 1
125 Vienna 1
126 Genoa 1
127 valley 1
128 Chillon 1
129 Earth 1
130 Dorcas 1
131 St. Martin 1
132 Hoffmann 1
133 Swartz 1
134 Hamburg 1
135 Professorin 1
136 Nearest 1
137 Gott 1
138 West 1
139 Tusser 1
140 Cowley 1
141 Pomonas 1
142 WASHINGTON STREET 1
143 BOSTON 1
144 BLOOM 1
145 New\n 1
146 Dark 1
147 New Hampshire 1
148 Washington St. 1
149 Boston 1
150 United States 1
151 GUTENBERG 1
152 Project 1
153 DIRECT 1
154 INDIRECT 1
155 Mississippi 1
156 Salt Lake City 1

Get Streets & Parks

Type Label

Description

FAC

Buildings, airports, highways, bridges, etc.

To extract and count streets and parks (which show up a lot in Lost in the City!), we can follow the same model as above, except we will change our if statement to check for “ent” labels that match “FAC.” This is the type label for “buildings, airports, highways, bridges, etc.”

streets = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "FAC":
            streets.append(named_entity.text)

streets_tally = Counter(streets)

df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df
street count
0 Pickwick Hall 2
1 the "Mouse 1
2 Pickwick 1
3 the Tower of Babel 1
4 THE PUBLIC BEREAVEMENT 1
5 the Barnville Theatre 1
6 the moon 1
7 the 'Rambler 1
8 Regent Street 1
9 Hyde Park 1
10 else,--for 1
11 the Rue de Rivoli 1
12 the Madame de Staëls 1
13 the Jardin Publique 1
14 Castle Hill 1
15 Chauvain 1
16 Paglioni 1
17 Saxon 1
18 the Royal Theatre 1
19 Saint Stefan's 1
20 the muddy street 1
21 The Aunt-Hill 1
22 A Village Story 1
23 the Aunt-Hill 1
24 Camp and Fireside Stories 1
25 the Pickwick Portfolio 1

Get Works of Art

Type Label

Description

WORK_OF_ART

Titles of books, songs, etc.

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if statement to match the “ent” label “WORK_OF_ART”).

works_of_art = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "WORK_OF_ART":
            works_of_art.append(named_entity.text)

            art_tally = Counter(works_of_art)

df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df
work_of_art count
0 Aunt March 5
1 Poor Jo 3
2 Merry Christmas 3
3 Teddy 3
4 Little Men 3
5 Daisy 2
6 The Pickwick Portfolio 2
7 Illustration: Tail-piece]\n\n\n\n\n 2
8 Illustration: 2
9 Illustration: Tail-piece 2
10 Come, Jo 2
11 Why, Jo 2
12 Bless 2
13 Love 2
14 the List of Illustrations 2
15 Project Gutenberg 2
16 Plain Vanilla ASCII 2
17 Little Tranquillity 1
18 The Witch's Curse 1
19 replied Mrs. March 1
20 Crinkle, crinkle, 'ittle 'tar 1
21 Die Engel-kinder 1
22 Hither I come,\n From my airy home 1
23 The Laurence 1
24 Dear me, I didn't know any one was here 1
25 Quel nom 1
26 Sixteen 1
27 Illustration: They sat down on the stairs 1
28 Buzz 1
29 Essays by the hour 1
30 Petrea's 1
31 Little Raphael 1
32 Theodore\n 1
33 The Seven Castles of the Diamond Lake 1
34 Nor I,--" 1
35 world,--marry 1
36 Aunt\nCockle-top 1
37 WOMAN AND HER POSITION 1
38 THE GREEK SLAVE 1
39 Poor old Jo 1
40 The Wide, Wide World 1
41 Croaker 1
42 DEAR JO 1
43 LAURIE 1
44 Rigmarole 1
45 Cutlasses 1
46 Grandfather and Napoleon 1
47 Delectable Mountain 1
48 Teddy's wrongs 1
49 Hurrah for Miss March 1
50 Spread Eagles 1
51 The Rival Painters 1
52 What _will_ 1
53 Evelina 1
54 Can't wait, and I'm afraid I haven't much faith in ink and dirt, though\n 1
55 O Jo 1
56 Handsome faces,--eyes 1
57 TOPSY-TURVY JO 1
58 A SONG FROM THE SUDS 1
59 Queen of my tub, I merrily sing,\n 1
60 MA CHERE MAMMA,--\n\n 1
61 "Yours Respectful 1
62 HANNAH MULLET 1
63 JAMES LAURENCE 1
64 Ask Hannah 1
65 Madame 1
66 _Witnesses_: 1
67 THEODORE LAURENCE 1
68 About Meg 1
69 Stop, Jo 1
70 O Meg 1
71 Johnson 1
72 Rasselas 1
73 "Hang the 'Rambler 1
74 THE JUNGFRAU TO BETH 1
75 A portrait of Joanna 1
76 Illustration: The Jungfrau 1
77 Jungfrau 1
78 Fulness 1
79 Illustration: Popping in her head now and then]\n\nLike 1
80 _I_ 1
81 Illustration: Shall I tell you how?]\n\n 1
82 LITTLE WOMEN 1
83 Illustration: Home of the Little Women]\n\n\n\n\n 1
84 The Spread Eagle 1
85 Yankee 1
86 Illustration: The First Wedding]\n\n XXV 1
87 Jupiter Ammon 1
88 By Jove 1
89 Run, Beth 1
90 Pharaohs 1
91 A Phantom\nHand 1
92 Curse of the Coventrys 1
93 Jo March 1
94 Yes, Amy was in despair that day 1
95 Speak for yourself, if you please. 1
96 Teddy's Own 1
97 Aunt Carrol 1
98 Illustration: Tail-piece]\n\n\n\n\n 1
99 Aunt and Flo were poorly all the way, and liked to be let\n alone, so when I had done what I could for them, I went and\n enjoyed myself. 1
100 A pause,--then Flo cried out, 'Bless 1
101 The Flirtations of Capt 1
102 'Now then, mum 1
103 Rotten Row 1
104 MIDNIGHT 1
105 Aunt 1
106 The Palais Royale 1
107 Having a quiet hour 1
108 part,--for 1
109 Ever your AMY 1
110 Olympia's Oath 1
111 Mercy on me, Beth loves Laurie 1
112 'Out upon you, fie upon you,\n Bold-faced jig 1
113 Yes, Jo 1
114 DEAR MARMEE AND BETH,--\n\n 1
115 'Now 1
116 'Me wants my Bhaer,' 1
117 'Now me mus tuddy my lessin 1
118 'Now Professor 1
119 'Governess 1
120 'Friend of the old lady's 1
121 'Handsome head, but no style 1
122 'Not a bit of it. 1
123 MY PRECIOUS 1
124 A Happy New Year 1
125 Bible 1
126 Bear or Beer 1
127 Sartor Resartus 1
128 Mees Marsch 1
129 Weekly Volcano 1
130 Sonata Pathétique 1
131 A Christmas party at our hotel. 1
132 Illustration: Mornin' now]\n\n 1
133 'Lazy Laurence 1
134 She _was_ kind 1
135 'Lazy Laurence' 1
136 "Yours gratefully, TELEMACHUS 1
137 Illustration: The Valley of the Shadow 1
138 Opera 1
139 the Alps of Savoy 1
140 Héloise 1
141 Dear Jo 1
142 Not to-night 1
143 Please, Madam Mother 1
144 Beth 1
145 The Three Little Kittens 1
146 IN THE GARRET.\n\n 1
147 'Meg' on the first lid, smooth and fair 1
148 'Jo' on the next lid 1
149 My Beth 1
150 Leaving Mrs. March 1
151 Eight Cousins 1
152 Meg 1
153 How They Camped\n Out 1
154 Roses and Forget-me-nots" 1
155 How They Ran Away 1
156 Pansies 1
157 "Water-Lilies 1
158 Shadow-Children 1
159 The Moss People 1
160 Little Women 1
161 The Pickwick Papers 1
162 Betty 1
163 Head Nurse of Ward 1
164 Bold-faced\njig 1
165 Tarantella 1
166 THE LITTLE WOMEN\nSERIES 1
167 The Works of Louisa 1
168 Project\nGutenberg 1
169 the\nFoundation 1
170 Project\n 1
171 Right\nof Replacement or Refund 1

Get NER in Context

from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))
for document in chunked_documents:
    get_ner_in_context('Jupiter', document)

LOC

By Jupiter


WORK_OF_ART

A crash, a cry, and a laugh from Laurie, accompanied by the indecorous exclamation, "Jupiter Ammon!


LOC

"Twins, by Jupiter!

Your Turn!

Now it’s your turn to take a crack at NER with a whole new text!

Type Label

Description

PERSON

People, including fictional.

NORP

Nationalities or religious or political groups.

FAC

Buildings, airports, highways, bridges, etc.

ORG

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOC

Non-GPE locations, mountain ranges, bodies of water.

PRODUCT

Objects, vehicles, foods, etc. (Not services.)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK_OF_ART

Titles of books, songs, etc.

LAW

Named documents made into laws.

LANGUAGE

Any named language.

DATE

Absolute or relative dates or periods.

TIME

Times smaller than a day.

PERCENT

Percentage, including ”%“.

MONEY

Monetary values, including unit.

QUANTITY

Measurements, as of weight or distance.

ORDINAL

“first”, “second”, etc.

CARDINAL

Numerals that do not fall under another type.

In this section, you’re going to extract and count named entities from Barack Obama’s memoir The Audacity of Hope. We’re exploring Obama’s memoir because it’s chock full of named entities.

Open and read the text file

filepath = "../texts/literature/Obama-The-Audacity-of-Hope.txt"
text = open(filepath, encoding='utf-8').read()

To process The Audacity of Hope in smaller chunks (if working in Binder or on a computer with memory constraints):

chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))

To process The Audacity of Hope all at once (if working on a computer with a larger amount of memory):

document = nlp(text)

1. Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you’ve chosen) in The Audacity of Hope. If you need help, study the examples above.

#Your Code Here 👇 

2. What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?

**#**Your answer here. (Double click this cell to type your answer.)

3. What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?

**#**Your answer here. (Double click this cell to type your answer.)

4. What’s an insight that you might be able to glean about The Audacity of Hope based on your NER extraction?

**#**Your answer here. (Double click this cell to type your answer.)