Named Entity Recognition¶
In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.
Dataset¶
Ada Lovelace’s Obituary & Louisa May Alcott’s Little Women¶
A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843.
—Claire Cain Miller, “Ada Lovelace,” New York Times Overlooked Obituaries
Here’s a preview of spaC’s NER tagging Ada Lovelace’s obituary:
Why is NER Useful?¶
Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we’ll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we’ll also do in a later lesson!).
Natural Language Processing (NLP)¶
Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). What is NLP, exactly? NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, and Siri? These are all examples of NLP in action!
Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts are now getting eerily good. (If you don’t believe me, check out this app that will autocomplete your sentences with GPT-2, a state-of-the-art text generation model. When I ran it, the model generated a mini-lecture from a “university professor” that sounds spookily close to home…)

Open-source NLP tools are getting very good, too. We’re going to use one of these open-source tools, the Python library spaCy
, for our Named Entity Recognition tasks in this lesson.
How spaCy Works¶
The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace’s New York Times obituary (something that we’ll test out for ourselves below). How does spaCy know that “Ada Lovelace” is a person and that “1843” is a date?
Well, spaCy doesn’t know, not for sure anyway. Instead, spaCy is making a very educated guess. This “guess” is based on what spaCy has learned about the English language after seeing lots of other examples.
That’s a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. (These texts were, in fact, often labeled and corrected by hand). This is similar to our topic modeling work from the previous lesson, except our topic model wasn’t using labeled data.
The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. (Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.)
When spaCy identifies people and places in Ada Lovelace’s obituary, in other words, the NLP model is actually making predictions about the text based on what it has learned about how people and places function in English-language sentences.
NER with spaCy¶
Install spaCy¶
!pip install -U spacy
Import Libraries¶
We’re going to import spacy
and displacy
, a special spaCy module for visualization.
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400
We’re also going to import the Counter
module for counting people, places, and things, and the pandas
library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).
Download Language Model¶
Next we need to download the English-language model (en_core_web_sm
), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm
model by running the cell below:
!python -m spacy download en_core_web_sm
Note: spaCy offers models for other languages including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don’t currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as PyviKonlpy for Korean or Jieba for Chinese.
Load Language Model¶
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.
1. We can import the model as a module and then load it from the module.
import en_core_web_sm
nlp = en_core_web_sm.load()
2. We can load the model by name.
#nlp = spacy.load('en_core_web_sm')
If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).
Process Document¶
We first need to process our document
with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.
After processing, the document
object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.
In the cell below, we open and read Ada Lovelace’s obituary. Then we runnlp()
on the text and create our document.
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)
spaCy Named Entities¶
Below is a Named Entities chart taken from spaCy’s website, which shows the different named entities that spaCy can identify as well as their corresponding type labels.
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
To quickly see spaCy’s NER in action, we can use the spaCy module displacy
with the style=
parameter set to “ent” (short for entities):
displacy.render(document, style="ent")
From a quick glance at the text above, we can see that spaCy is doing quite well with NER. But it’s definitely not perfect.
Though spaCy correctly identifies “Ada Lovelace” as a PERSON
in the first sentence, just a few sentences later it labels her as a WORK_OF_ART
. Though spaCy correctly identifies “London” as a place GPE
a few paragraphs down, it incorrectly identifies “Jacquard” as a place GPE
, too (when really “Jacquard” is a type of loom, named after Marie Jacquard).
This inconsistency is very important to note and keep in mind. If we wanted to use spaCy’s NER for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn’t be perfect. That’s why understanding the limitations of this tool is so crucial. While spaCy’s NER can be very good for identifying entities in broad strokes, it can’t be relied upon for anything exact and fine-grained — not out of the box anyway.
Get Named Entities¶
All the named entities in our document
can be found in the document.ents
property. If we check out document.ents
, we can see all the entities from Ada Lovelace’s obituary.
document.ents
(first,
CLAIRE,
A century,
Ada Lovelace,
the modern-day,
1843,
Jacquard,
British,
Charles Babbage,
Analytical Engine,
Lovelace,
1852,
36,
first,
the Analytical Engine,
seventh,
Bernoulli,
Bernoulli,
Swiss,
Jacob Bernoulli,
Walter Isaacson,
“The Innovators,
Any piece of content,
The Analytical Engine,
British,
Lord Byron,
Romantic,
Betty Alexandra Toole,
Lovelace,
the mid-20th century,
the Defense Department,
October,
Lady Lovelace,
The London Examiner,
Sciences,
Augusta Ada Byron,
Dec. 10, 1815,
London,
Byron,
Annabella Milbanke,
8,
Lord Byron,
Medea,
Smith Collection/Gado/,
British,
the day,
Mary Somerville,
Somerville,
Lovelace,
Babbage,
17,
two-foot,
almost two decades,
William King,
Somerville,
1835,
19,
the Countess of Lovelace,
1839,
two,
Somerville,
Mathematics,
every day,
Trigonometry,
Cubic and Biquadratic Equations,
1840,
Lovelace,
Augustus De Morgan,
London,
first,
1843,
27,
Lovelace,
Babbage Analytical Engine,
nearly three,
Notes,
first,
Ursula Martin,
the University of Oxford,
Lovelace,
less than a decade later,
Nov. 27, 1852,
Notes,
Claire Cain Miller,
The Upshot,
first,
Ada Lovelace)
Each of the named entities in document.ents
contains more information about itself, which we can access by iterating through the document.ents
with a simple for
loop.
For each named_entity
in document.ents
, we will extract the named_entity
and its corresponding named_entity.label_
.
for named_entity in document.ents:
print(named_entity, named_entity.label_)
first ORDINAL
CLAIRE ORG
A century DATE
Ada Lovelace ORG
the modern-day DATE
1843 DATE
Jacquard NORP
British NORP
Charles Babbage PERSON
Analytical Engine ORG
Lovelace ORG
1852 DATE
36 CARDINAL
first ORDINAL
the Analytical Engine ORG
seventh ORDINAL
Bernoulli ORG
Bernoulli ORG
Swiss NORP
Jacob Bernoulli PERSON
Walter Isaacson PERSON
“The Innovators WORK_OF_ART
Any piece of content WORK_OF_ART
The Analytical Engine WORK_OF_ART
British NORP
Lord Byron PERSON
Romantic ORG
Betty Alexandra Toole PERSON
Lovelace PERSON
the mid-20th century DATE
the Defense Department ORG
October DATE
Lady Lovelace PERSON
The London Examiner ORG
Sciences ORG
Augusta Ada Byron PERSON
Dec. 10, 1815 DATE
London GPE
Byron PERSON
Annabella Milbanke PERSON
8 DATE
Lord Byron PERSON
Medea PERSON
Smith Collection/Gado/ ORG
British NORP
the day DATE
Mary Somerville PERSON
Somerville PERSON
Lovelace PERSON
Babbage PERSON
17 DATE
two-foot QUANTITY
almost two decades DATE
William King PERSON
Somerville PERSON
1835 DATE
19 DATE
the Countess of Lovelace WORK_OF_ART
1839 DATE
two CARDINAL
Somerville PERSON
Mathematics PERSON
every day DATE
Trigonometry GPE
Cubic and Biquadratic Equations ORG
1840 DATE
Lovelace ORG
Augustus De Morgan PERSON
London GPE
first ORDINAL
1843 DATE
27 CARDINAL
Lovelace PERSON
Babbage Analytical Engine ORG
nearly three CARDINAL
Notes WORK_OF_ART
first ORDINAL
Ursula Martin PERSON
the University of Oxford ORG
Lovelace ORG
less than a decade later DATE
Nov. 27, 1852 DATE
Notes PRODUCT
Claire Cain Miller PERSON
The Upshot WORK_OF_ART
first ORDINAL
Ada Lovelace ORG
To extract just the named entities that have been identified as PERSON
, we can add a simple if
statement into the mix:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
print(named_entity)
Charles Babbage
Jacob Bernoulli
Walter Isaacson
Lord Byron
Betty Alexandra Toole
Lovelace
Lady Lovelace
Augusta Ada Byron
Byron
Annabella Milbanke
Lord Byron
Medea
Mary Somerville
Somerville
Lovelace
Babbage
William King
Somerville
Somerville
Mathematics
Augustus De Morgan
Lovelace
Ursula Martin
Claire Cain Miller
NER with Long Texts or Many Texts¶
For the rest of this lesson, we’re going to work with Edward P. Jones’s short story collection Lost in the City, specifically the first story, “The Girl Who Raised Pigeons.”
filepath = "../texts/literature/Little-Women_Louisa-May-Alcott.txt"
text = open(filepath).read()
import math
number_of_chunks = 80
chunk_size = math.ceil(len(text) / number_of_chunks)
text_chunks = []
for number in range(0, len(text), chunk_size):
text_chunk = text[number:number+chunk_size]
text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))
Get People¶
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
To extract and count the people, we will use an if
statement that will pull out words only if their “ent” label matches “PERSON.”
people = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
people.append(named_entity.text)
people_tally = Counter(people)
df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character | count | |
---|---|---|
0 | Jo | 1354 |
1 | Laurie | 581 |
2 | Amy | 580 |
3 | Meg | 545 |
4 | Beth | 460 |
5 | John | 145 |
6 | Hannah | 112 |
7 | Brooke | 100 |
8 | Laurence | 98 |
9 | Bhaer | 83 |
10 | Teddy | 61 |
11 | Demi | 47 |
12 | Fred | 44 |
13 | Sallie | 41 |
14 | Kate | 28 |
15 | Daisy | 25 |
16 | Moffat | 21 |
17 | Margaret | 20 |
18 | Ned | 19 |
19 | Dashwood | 18 |
20 | Davis | 17 |
21 | Belle | 16 |
22 | Frank | 16 |
23 | Scott | 14 |
24 | Tina | 13 |
25 | Kirke | 12 |
26 | Zara | 11 |
27 | Fritz | 10 |
28 | Crocker | 9 |
29 | John Brooke | 9 |
30 | Tudor | 9 |
31 | papa | 8 |
32 | Grace | 8 |
33 | March | 8 |
34 | Chester | 8 |
35 | Roderigo | 7 |
36 | Aunt | 7 |
37 | Carrol | 7 |
38 | Alcott | 7 |
39 | Don Pedro | 6 |
40 | Shakespeare | 6 |
41 | Hagar | 6 |
42 | Pip | 6 |
43 | Hummel | 6 |
44 | Bangs | 6 |
45 | Esther | 6 |
46 | Lamb | 6 |
47 | Aunt Carrol | 6 |
48 | K. | 6 |
49 | Gardiner | 5 |
50 | Hush | 5 |
51 | Joanna | 5 |
52 | Jenny | 5 |
53 | Pickwick | 5 |
54 | Snodgrass | 5 |
55 | Lotty | 5 |
56 | Thou | 5 |
57 | Kitty | 5 |
58 | Belsham | 4 |
59 | King | 4 |
60 | Mademoiselle | 4 |
61 | Ned Moffat | 4 |
62 | Bethy | 4 |
63 | Mary | 4 |
64 | Ellen Tree | 4 |
65 | _ | 4 |
66 | Franz | 4 |
67 | Norton | 4 |
68 | Friedrich | 4 |
69 | Josephine | 3 |
70 | Gott | 3 |
71 | Susie | 3 |
72 | Cutter | 3 |
73 | Snow | 3 |
74 | George | 3 |
75 | Clara | 3 |
76 | Hortense | 3 |
77 | Brown | 3 |
78 | Sallie Gardiner | 3 |
79 | Fred Vaughn | 3 |
80 | David | 3 |
81 | Jimmy | 3 |
82 | Minna | 3 |
83 | Sallie Moffat | 3 |
84 | Randal | 3 |
85 | Parker | 3 |
86 | Grundy | 3 |
87 | Jack | 3 |
88 | May | 3 |
89 | Minnie | 3 |
90 | Friedrich Bhaer | 3 |
91 | Project Gutenberg-tm | 3 |
92 | Elizabeth | 2 |
93 | Apollyon | 2 |
94 | ma | 2 |
95 | Das | 2 |
96 | Don Pedro's | 2 |
97 | Theodore | 2 |
98 | Dora | 2 |
99 | Annie Moffat | 2 |
100 | Jo _ | 2 |
101 | Josy-phine | 2 |
102 | bob | 2 |
103 | Jenny Snow | 2 |
104 | JO | 2 |
105 | Lincoln | 2 |
106 | Nan | 2 |
107 | Dickens | 2 |
108 | Samuel Pickwick | 2 |
109 | Tracy Tupman | 2 |
110 | Nathaniel Winkle | 2 |
111 | Winkle | 2 |
112 | Sam Weller | 2 |
113 | Malaprop | 2 |
114 | Christopher Columbus | 2 |
115 | Lying | 2 |
116 | bandboxes | 2 |
117 | aloud | 2 |
118 | Down | 2 |
119 | bush | 2 |
120 | bein | 2 |
121 | Lottchen | 2 |
122 | Kitty Bryant's | 2 |
123 | Frenchwoman | 2 |
124 | Mis | 2 |
125 | Raphael | 2 |
126 | Eliott | 2 |
127 | Cornelius | 2 |
128 | Demijohn | 2 |
129 | Poor Jo | 2 |
130 | Kate Kearney | 2 |
131 | Flo | 2 |
132 | Emil | 2 |
133 | Scotts | 2 |
134 | Brookes | 2 |
135 | Baptiste | 2 |
136 | Meek | 2 |
137 | Mozart | 2 |
138 | Dodo | 2 |
139 | Aunt Dodo | 2 |
140 | Rob | 2 |
141 | Ted | 2 |
142 | Charles Dickens | 2 |
143 | Macbeth | 1 |
144 | Pilgrim | 1 |
145 | thee | 1 |
146 | Airy | 1 |
147 | Act | 1 |
148 | Santa Claus | 1 |
149 | Christopher | 1 |
150 | Columbus | 1 |
151 | spandy nice | 1 |
152 | cette jeune demoiselle | 1 |
153 | les pantoufles jolis | 1 |
154 | BURDENS | 1 |
155 | Kings | 1 |
156 | Florence | 1 |
157 | Maria\n | 1 |
158 | Parks | 1 |
159 | dahlia | 1 |
160 | Ellen | 1 |
161 | Susie Perkins | 1 |
162 | horrid!--and | 1 |
163 | Laugh | 1 |
164 | miles | 1 |
165 | mother,--one | 1 |
166 | Uncle Tom | 1 |
167 | Ivanhoe | 1 |
168 | Laurie forgot | 1 |
169 | Slough | 1 |
170 | Laurie rich | 1 |
171 | JAMES LAURENCE | 1 |
172 | James Laurence' | 1 |
173 | lingy | 1 |
174 | Katy Brown | 1 |
175 | Mary Kingsley | 1 |
176 | Miss Snow | 1 |
177 | Blimber | 1 |
178 | APOLLYON | 1 |
179 | queen | 1 |
180 | Bremer | 1 |
181 | Beth _ | 1 |
182 | Held Amy | 1 |
183 | Shivering | 1 |
184 | M. | 1 |
185 | Miss Clara | 1 |
186 | Cinderella | 1 |
187 | brooch | 1 |
188 | Miss Belle | 1 |
189 | s hang | 1 |
190 | Fisher | 1 |
191 | chickweed | 1 |
192 | diversions,--some | 1 |
193 | Gondola | 1 |
194 | gondola | 1 |
195 | Knights | 1 |
196 | the Lady Viola | 1 |
197 | Tis | 1 |
198 | Unmask | 1 |
199 | Ferdinand Devereux | 1 |
200 | Ferdinand | 1 |
201 | PICKWICK | 1 |
202 | Snowball Pat | 1 |
203 | S. B. PAT PAW | 1 |
204 | Snowball | 1 |
205 | Lecture | 1 |
206 | Hannah Brown | 1 |
207 | BETH BOUNCER | 1 |
208 | Constantine the Avenger | 1 |
209 | HINTS | 1 |
210 | S. P. | 1 |
211 | T. T. | 1 |
212 | N. W.\n | 1 |
213 | Welleresque | 1 |
214 | mails,--also | 1 |
215 | martin-house | 1 |
216 | Weller | 1 |
217 | The P. O. | 1 |
218 | Sairy Gamp | 1 |
219 | Katy | 1 |
220 | Flora McFlimsey | 1 |
221 | Boaz | 1 |
222 | Language | 1 |
223 | Laurie wrote,--\n\n | 1 |
224 | Kate Vaughn | 1 |
225 | Sunshine | 1 |
226 | Barker | 1 |
227 | Leghorn Laurie | 1 |
228 | Longmeadow | 1 |
229 | acorns | 1 |
230 | Thankee | 1 |
231 | Bosen | 1 |
232 | mermaid | 1 |
233 | Fred, Sallie | 1 |
234 | John Bull | 1 |
235 | Miss Kate | 1 |
236 | Mary Stuart | 1 |
237 | woe | 1 |
238 | Englishwoman | 1 |
239 | flung | 1 |
240 | blunt Jo | 1 |
241 | Bent | 1 |
242 | SECRETS | 1 |
243 | Scrabble | 1 |
244 | stairs | 1 |
245 | rang | 1 |
246 | Angelo | 1 |
247 | doin | 1 |
248 | Burney | 1 |
249 | before,--that | 1 |
250 | barber | 1 |
251 | Thomas | 1 |
252 | chestnut lock | 1 |
253 | Breakfast | 1 |
254 | Greatheart | 1 |
255 | Coffee | 1 |
256 | Meggy | 1 |
257 | Kiss | 1 |
258 | Merci | 1 |
259 | Papa | 1 |
260 | accordin | 1 |
261 | wearin | 1 |
262 | Hannah Mullet | 1 |
263 | Rappahannock | 1 |
264 | Quartermaster Mullett keeps | 1 |
265 | MADAM,--\n\n | 1 |
266 | Glad | 1 |
267 | Scarlet | 1 |
268 | sore throat | 1 |
269 | Call Meg | 1 |
270 | Hannah _ | 1 |
271 | Amy _ | 1 |
272 | ady | 1 |
273 | sech | 1 |
274 | Beth day | 1 |
275 | baker | 1 |
276 | Divine | 1 |
277 | Weary Hannah | 1 |
278 | Hark | 1 |
279 | Mop | 1 |
280 | Estelle | 1 |
281 | Madame | 1 |
282 | Pro-cras-ti | 1 |
283 | Protestant | 1 |
284 | Allyluyer | 1 |
285 | Amy Curtis | 1 |
286 | Theodore Laurence | 1 |
287 | Noter Dame | 1 |
288 | Kitty Bryant | 1 |
289 | Anni Domino | 1 |
290 | CONFIDENTIAL | 1 |
291 | Meg marry him | 1 |
292 | Caroline Percy | 1 |
293 | John _ | 1 |
294 | Grandfather | 1 |
295 | Sam | 1 |
296 | Queen Bess | 1 |
297 | Purrer | 1 |
298 | Madonna | 1 |
299 | Child | 1 |
300 | mum | 1 |
301 | bilin | 1 |
302 | Brooke,--at | 1 |
303 | now,--for | 1 |
304 | my John_ | 1 |
305 | Annie Moffat's | 1 |
306 | Sha'n't | 1 |
307 | Cook | 1 |
308 | James Laurence | 1 |
309 | Book | 1 |
310 | Sister Jo | 1 |
311 | GOSSIP | 1 |
312 | talked slang | 1 |
313 | frank confession | 1 |
314 | decorums | 1 |
315 | le brown house | 1 |
316 | pell | 1 |
317 | Psyche Laurie | 1 |
318 | merry words | 1 |
319 | knives | 1 |
320 | Toodles | 1 |
321 | Henshaw | 1 |
322 | Gummidge | 1 |
323 | Mark | 1 |
324 | Laurie frown | 1 |
325 | Uncle Carrol | 1 |
326 | Bacchus | 1 |
327 | Romeo | 1 |
328 | Madonnas | 1 |
329 | Michael Angelo | 1 |
330 | Maria Theresa | 1 |
331 | Nil | 1 |
332 | S. L. A. N. G. Northbury | 1 |
333 | Belzoni | 1 |
334 | Aim | 1 |
335 | Allen | 1 |
336 | robin | 1 |
337 | Keats | 1 |
338 | Martha | 1 |
339 | mutton | 1 |
340 | Receipt\nBook | 1 |
341 | Jack Scott | 1 |
342 | bemoan | 1 |
343 | Niobe | 1 |
344 | Mantalini | 1 |
345 | Ned Moffat's | 1 |
346 | John dryly | 1 |
347 | Presently Jo | 1 |
348 | Uncle Teddy | 1 |
349 | John Laurence | 1 |
350 | Megs | 1 |
351 | Shylock | 1 |
352 | Calm | 1 |
353 | Good_-by | 1 |
354 | Tom Brown | 1 |
355 | Tommy Chamberlain | 1 |
356 | Tommy | 1 |
357 | Framed | 1 |
358 | Killarney | 1 |
359 | Shun | 1 |
360 | Briton | 1 |
361 | Ward | 1 |
362 | Robert Lennox's | 1 |
363 | Kenilworth | 1 |
364 | Aunt Mary | 1 |
365 | Aye | 1 |
366 | -skelter | 1 |
367 | Punch | 1 |
368 | Route de Roi_ | 1 |
369 | Noah | 1 |
370 | Fechter | 1 |
371 | Frank Vaughn | 1 |
372 | Beth Frank | 1 |
373 | Camp Laurence | 1 |
374 | fellows,--especially Fred | 1 |
375 | Rainy | 1 |
376 | Marie Antoinette's | 1 |
377 | Bois | 1 |
378 | Père la\n Chaise | 1 |
379 | knew,--except Laurie | 1 |
380 | Goethe | 1 |
381 | Dannecker | 1 |
382 | Ariadne | 1 |
383 | Blöndchen_ | 1 |
384 | he-- | 1 |
385 | extravagances,--sending | 1 |
386 | Mentor | 1 |
387 | Cock | 1 |
388 | bonnie Dundee | 1 |
389 | Mabel | 1 |
390 | Prut | 1 |
391 | Minnie\n Kirke | 1 |
392 | Lager Beer | 1 |
393 | Ursa Major | 1 |
394 | Bon | 1 |
395 | Dis | 1 |
396 | I. | 1 |
397 | to,--he | 1 |
398 | L. | 1 |
399 | Milton | 1 |
400 | Nick Bottom | 1 |
401 | Jacks | 1 |
402 | Northbury | 1 |
403 | Goot | 1 |
404 | Wallenstein | 1 |
405 | merry bass-viol | 1 |
406 | Sherwood | 1 |
407 | Hannah More | 1 |
408 | Phillips | 1 |
409 | Hail | 1 |
410 | lie orange-orchards | 1 |
411 | Victor Emmanuel | 1 |
412 | Que pensez | 1 |
413 | Avigdor | 1 |
414 | Genius | 1 |
415 | Tower | 1 |
416 | Corsica | 1 |
417 | Junoesque | 1 |
418 | Diana | 1 |
419 | Misses Davis | 1 |
420 | Cardiglia | 1 |
421 | Serene Something | 1 |
422 | Rothschild | 1 |
423 | Jew | 1 |
424 | Joneses | 1 |
425 | Vladimir | 1 |
426 | Balzac | 1 |
427 | Femme | 1 |
428 | Day | 1 |
429 | make,--forgotten | 1 |
430 | John likes,--talk | 1 |
431 | Mamma | 1 |
432 | Will Demi | 1 |
433 | Mornin | 1 |
434 | John the Just | 1 |
435 | Sallie Moffatt | 1 |
436 | Raphaella | 1 |
437 | Saint Laurence | 1 |
438 | Jouvin | 1 |
439 | Rarey | 1 |
440 | Paradise | 1 |
441 | Au | 1 |
442 | Aunty Beth | 1 |
443 | Earthly | 1 |
444 | Laurie good | 1 |
445 | Mendelssohn | 1 |
446 | Beethoven | 1 |
447 | Bach | 1 |
448 | kinder | 1 |
449 | gloomy St. Gingolf | 1 |
450 | Mont St. Bernard | 1 |
451 | du Midi | 1 |
452 | Lausanne | 1 |
453 | Rousseau | 1 |
454 | Clarens | 1 |
455 | XLII | 1 |
456 | Grief | 1 |
457 | Providence | 1 |
458 | Johnson | 1 |
459 | headstrong | 1 |
460 | Beth lay ill | 1 |
461 | Mercy | 1 |
462 | Marsch,--but | 1 |
463 | Monsieur de Trop | 1 |
464 | Know'st | 1 |
465 | Récamier | 1 |
466 | Aristotle | 1 |
467 | Happy Amy | 1 |
468 | Alcibiades | 1 |
469 | Daisy make patty-cakes | 1 |
470 | Aunt Amy | 1 |
471 | Aunt Beth | 1 |
472 | kindred | 1 |
473 | grandson | 1 |
474 | bübchen | 1 |
475 | thou beginnest | 1 |
476 | Fessor | 1 |
477 | Marsch | 1 |
478 | mein Gott | 1 |
479 | Catherine | 1 |
480 | lads,--a | 1 |
481 | Dicks | 1 |
482 | Bhaer-garten | 1 |
483 | Mother Bhaer | 1 |
484 | God | 1 |
485 | grandma | 1 |
486 | sixtieth | 1 |
487 | Grandma | 1 |
488 | Unlucky Jo' | 1 |
489 | Tommy Bangs | 1 |
490 | Reginald B. Birch | 1 |
491 | Alice Barber Stephens | 1 |
492 | Jessie Willcox Smith | 1 |
493 | Harriet Roosevelt Richards | 1 |
494 | Cupid | 1 |
495 | WOMEN | 1 |
496 | Rose | 1 |
497 | Bob | 1 |
498 | Betty | 1 |
499 | JACK | 1 |
500 | JILL | 1 |
501 | Ednah D. Cheney | 1 |
502 | SHAWL-STRAPS | 1 |
503 | Sol Eytinge | 1 |
504 | MEPHISTOPHELES | 1 |
505 | Marjorie | 1 |
506 | Baa | 1 |
507 | QUEEN ASTOR | 1 |
508 | Elizabeth L. Gould | 1 |
509 | Louisa | 1 |
510 | Amy_ | 1 |
511 | transcribe red- | 1 |
512 | ferrule | 1 |
513 | Webster | 1 |
514 | Charles\nDickens | 1 |
515 | Augustus Snodgrass | 1 |
516 | Samuel Weller | 1 |
517 | N. Winkle's | 1 |
518 | Betsey | 1 |
519 | Peggotty | 1 |
520 | David Copperfield | 1 |
521 | know'st thou | 1 |
522 | transcribe Dove-cote | 1 |
523 | Flora | 1 |
524 | Louisa M. Alcott | 1 |
525 | eBooks | 1 |
526 | FULL LICENSE | 1 |
527 | Michael\n | 1 |
528 | S.\n | 1 |
529 | Gregory B. Newby | 1 |
530 | Michael S. Hart | 1 |
531 | Web | 1 |
Get Places¶
Type Label |
Description |
---|---|
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
To extract and count places, we can follow the same model as above, except we will change our if
statement to check for “ent” labels that match “GPE” or “LOC.” These are the type labels for “counties cities, states” and “locations, mountain ranges, bodies of water.”
places = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
places.append(named_entity.text)
places_tally = Counter(places)
df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place | count | |
---|---|---|
0 | Marmee | 23 |
1 | Washington | 13 |
2 | Paris | 10 |
3 | America | 8 |
4 | Rome | 8 |
5 | London | 7 |
6 | the United States | 7 |
7 | Switzerland | 6 |
8 | china | 6 |
9 | Hannah | 6 |
10 | Germany | 6 |
11 | Hagar | 5 |
12 | Plumfield | 5 |
13 | Project Gutenberg | 5 |
14 | Italy | 3 |
15 | Bethy | 3 |
16 | Moffats | 3 |
17 | France | 3 |
18 | turkey | 3 |
19 | New York | 3 |
20 | Berlin | 3 |
21 | Tina | 3 |
22 | U.S. | 3 |
23 | Undine | 2 |
24 | Sintram | 2 |
25 | Valley | 2 |
26 | Europe | 2 |
27 | Egypt | 2 |
28 | the\nriver | 2 |
29 | Vaughns | 2 |
30 | Celestial City | 2 |
31 | Jupiter | 2 |
32 | India | 2 |
33 | A.M. | 2 |
34 | Berne | 2 |
35 | Baden-Baden | 2 |
36 | Nice | 2 |
37 | Garland | 2 |
38 | Illustrated | 2 |
39 | Gutenberg | 2 |
40 | China | 1 |
41 | Banquo | 1 |
42 | South | 1 |
43 | the City of Destruction | 1 |
44 | the Slough of Despond to-night | 1 |
45 | Asia | 1 |
46 | Africa | 1 |
47 | chintz | 1 |
48 | maroon | 1 |
49 | Vevay | 1 |
50 | Heidelberg | 1 |
51 | Belsham | 1 |
52 | Wakefield | 1 |
53 | Latin, Algebra | 1 |
54 | tarlatan | 1 |
55 | Chiny | 1 |
56 | mignonette | 1 |
57 | larkspur | 1 |
58 | VENICE | 1 |
59 | Winkle | 1 |
60 | Tupman | 1 |
61 | Bacon | 1 |
62 | Milton | 1 |
63 | Longmeadow | 1 |
64 | rations,--I'll | 1 |
65 | the\nMountains | 1 |
66 | Canada | 1 |
67 | blue river | 1 |
68 | ruddy | 1 |
69 | Atalanta | 1 |
70 | the House of March | 1 |
71 | earth | 1 |
72 | Chick | 1 |
73 | Pewmonia | 1 |
74 | Heinrich | 1 |
75 | Minna | 1 |
76 | the bay-window | 1 |
77 | new kingdom | 1 |
78 | him,--and | 1 |
79 | the\nvalley | 1 |
80 | Hercules | 1 |
81 | Sphinx | 1 |
82 | Lisbon | 1 |
83 | Spiritualism | 1 |
84 | deplore | 1 |
85 | Forum | 1 |
86 | LONDON | 1 |
87 | Halifax | 1 |
88 | us,--Mr. | 1 |
89 | the Lakes of Killarney | 1 |
90 | aye | 1 |
91 | mum | 1 |
92 | Devonshire | 1 |
93 | Wellington | 1 |
94 | PARIS | 1 |
95 | Hogarth | 1 |
96 | Richmond Park | 1 |
97 | Saint Denis | 1 |
98 | the Tuileries Gardens | 1 |
99 | Luxembourg Gardens | 1 |
100 | Rhine | 1 |
101 | Bonn | 1 |
102 | saw,--the river | 1 |
103 | Nassau | 1 |
104 | Byronic | 1 |
105 | sofa,--long | 1 |
106 | NEW YORK | 1 |
107 | Märchen | 1 |
108 | Nile | 1 |
109 | the frothy sea | 1 |
110 | Kant | 1 |
111 | Hegel | 1 |
112 | Promenade | 1 |
113 | Villa Franca | 1 |
114 | Schubert | 1 |
115 | Greece | 1 |
116 | Tarlatan | 1 |
117 | Continent | 1 |
118 | Pole | 1 |
119 | india | 1 |
120 | Babyland | 1 |
121 | happiest kingdom | 1 |
122 | Monaco | 1 |
123 | Baptiste | 1 |
124 | Paradise | 1 |
125 | Vienna | 1 |
126 | Genoa | 1 |
127 | valley | 1 |
128 | Chillon | 1 |
129 | Earth | 1 |
130 | Dorcas | 1 |
131 | St. Martin | 1 |
132 | Hoffmann | 1 |
133 | Swartz | 1 |
134 | Hamburg | 1 |
135 | Professorin | 1 |
136 | Nearest | 1 |
137 | Gott | 1 |
138 | West | 1 |
139 | Tusser | 1 |
140 | Cowley | 1 |
141 | Pomonas | 1 |
142 | WASHINGTON STREET | 1 |
143 | BOSTON | 1 |
144 | BLOOM | 1 |
145 | New\n | 1 |
146 | Dark | 1 |
147 | New Hampshire | 1 |
148 | Washington St. | 1 |
149 | Boston | 1 |
150 | United States | 1 |
151 | GUTENBERG | 1 |
152 | Project | 1 |
153 | DIRECT | 1 |
154 | INDIRECT | 1 |
155 | Mississippi | 1 |
156 | Salt Lake City | 1 |
Get Streets & Parks¶
Type Label |
Description |
---|---|
FAC |
Buildings, airports, highways, bridges, etc. |
To extract and count streets and parks (which show up a lot in Lost in the City!), we can follow the same model as above, except we will change our if
statement to check for “ent” labels that match “FAC.” This is the type label for “buildings, airports, highways, bridges, etc.”
streets = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "FAC":
streets.append(named_entity.text)
streets_tally = Counter(streets)
df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df
street | count | |
---|---|---|
0 | Pickwick Hall | 2 |
1 | the "Mouse | 1 |
2 | Pickwick | 1 |
3 | the Tower of Babel | 1 |
4 | THE PUBLIC BEREAVEMENT | 1 |
5 | the Barnville Theatre | 1 |
6 | the moon | 1 |
7 | the 'Rambler | 1 |
8 | Regent Street | 1 |
9 | Hyde Park | 1 |
10 | else,--for | 1 |
11 | the Rue de Rivoli | 1 |
12 | the Madame de Staëls | 1 |
13 | the Jardin Publique | 1 |
14 | Castle Hill | 1 |
15 | Chauvain | 1 |
16 | Paglioni | 1 |
17 | Saxon | 1 |
18 | the Royal Theatre | 1 |
19 | Saint Stefan's | 1 |
20 | the muddy street | 1 |
21 | The Aunt-Hill | 1 |
22 | A Village Story | 1 |
23 | the Aunt-Hill | 1 |
24 | Camp and Fireside Stories | 1 |
25 | the Pickwick Portfolio | 1 |
Get Works of Art¶
Type Label |
Description |
---|---|
WORK_OF_ART |
Titles of books, songs, etc. |
To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if
statement to match the “ent” label “WORK_OF_ART”).
works_of_art = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "WORK_OF_ART":
works_of_art.append(named_entity.text)
art_tally = Counter(works_of_art)
df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df
work_of_art | count | |
---|---|---|
0 | Aunt March | 5 |
1 | Poor Jo | 3 |
2 | Merry Christmas | 3 |
3 | Teddy | 3 |
4 | Little Men | 3 |
5 | Daisy | 2 |
6 | The Pickwick Portfolio | 2 |
7 | Illustration: Tail-piece]\n\n\n\n\n | 2 |
8 | Illustration: | 2 |
9 | Illustration: Tail-piece | 2 |
10 | Come, Jo | 2 |
11 | Why, Jo | 2 |
12 | Bless | 2 |
13 | Love | 2 |
14 | the List of Illustrations | 2 |
15 | Project Gutenberg | 2 |
16 | Plain Vanilla ASCII | 2 |
17 | Little Tranquillity | 1 |
18 | The Witch's Curse | 1 |
19 | replied Mrs. March | 1 |
20 | Crinkle, crinkle, 'ittle 'tar | 1 |
21 | Die Engel-kinder | 1 |
22 | Hither I come,\n From my airy home | 1 |
23 | The Laurence | 1 |
24 | Dear me, I didn't know any one was here | 1 |
25 | Quel nom | 1 |
26 | Sixteen | 1 |
27 | Illustration: They sat down on the stairs | 1 |
28 | Buzz | 1 |
29 | Essays by the hour | 1 |
30 | Petrea's | 1 |
31 | Little Raphael | 1 |
32 | Theodore\n | 1 |
33 | The Seven Castles of the Diamond Lake | 1 |
34 | Nor I,--" | 1 |
35 | world,--marry | 1 |
36 | Aunt\nCockle-top | 1 |
37 | WOMAN AND HER POSITION | 1 |
38 | THE GREEK SLAVE | 1 |
39 | Poor old Jo | 1 |
40 | The Wide, Wide World | 1 |
41 | Croaker | 1 |
42 | DEAR JO | 1 |
43 | LAURIE | 1 |
44 | Rigmarole | 1 |
45 | Cutlasses | 1 |
46 | Grandfather and Napoleon | 1 |
47 | Delectable Mountain | 1 |
48 | Teddy's wrongs | 1 |
49 | Hurrah for Miss March | 1 |
50 | Spread Eagles | 1 |
51 | The Rival Painters | 1 |
52 | What _will_ | 1 |
53 | Evelina | 1 |
54 | Can't wait, and I'm afraid I haven't much faith in ink and dirt, though\n | 1 |
55 | O Jo | 1 |
56 | Handsome faces,--eyes | 1 |
57 | TOPSY-TURVY JO | 1 |
58 | A SONG FROM THE SUDS | 1 |
59 | Queen of my tub, I merrily sing,\n | 1 |
60 | MA CHERE MAMMA,--\n\n | 1 |
61 | "Yours Respectful | 1 |
62 | HANNAH MULLET | 1 |
63 | JAMES LAURENCE | 1 |
64 | Ask Hannah | 1 |
65 | Madame | 1 |
66 | _Witnesses_: | 1 |
67 | THEODORE LAURENCE | 1 |
68 | About Meg | 1 |
69 | Stop, Jo | 1 |
70 | O Meg | 1 |
71 | Johnson | 1 |
72 | Rasselas | 1 |
73 | "Hang the 'Rambler | 1 |
74 | THE JUNGFRAU TO BETH | 1 |
75 | A portrait of Joanna | 1 |
76 | Illustration: The Jungfrau | 1 |
77 | Jungfrau | 1 |
78 | Fulness | 1 |
79 | Illustration: Popping in her head now and then]\n\nLike | 1 |
80 | _I_ | 1 |
81 | Illustration: Shall I tell you how?]\n\n | 1 |
82 | LITTLE WOMEN | 1 |
83 | Illustration: Home of the Little Women]\n\n\n\n\n | 1 |
84 | The Spread Eagle | 1 |
85 | Yankee | 1 |
86 | Illustration: The First Wedding]\n\n XXV | 1 |
87 | Jupiter Ammon | 1 |
88 | By Jove | 1 |
89 | Run, Beth | 1 |
90 | Pharaohs | 1 |
91 | A Phantom\nHand | 1 |
92 | Curse of the Coventrys | 1 |
93 | Jo March | 1 |
94 | Yes, Amy was in despair that day | 1 |
95 | Speak for yourself, if you please. | 1 |
96 | Teddy's Own | 1 |
97 | Aunt Carrol | 1 |
98 | Illustration: Tail-piece]\n\n\n\n\n | 1 |
99 | Aunt and Flo were poorly all the way, and liked to be let\n alone, so when I had done what I could for them, I went and\n enjoyed myself. | 1 |
100 | A pause,--then Flo cried out, 'Bless | 1 |
101 | The Flirtations of Capt | 1 |
102 | 'Now then, mum | 1 |
103 | Rotten Row | 1 |
104 | MIDNIGHT | 1 |
105 | Aunt | 1 |
106 | The Palais Royale | 1 |
107 | Having a quiet hour | 1 |
108 | part,--for | 1 |
109 | Ever your AMY | 1 |
110 | Olympia's Oath | 1 |
111 | Mercy on me, Beth loves Laurie | 1 |
112 | 'Out upon you, fie upon you,\n Bold-faced jig | 1 |
113 | Yes, Jo | 1 |
114 | DEAR MARMEE AND BETH,--\n\n | 1 |
115 | 'Now | 1 |
116 | 'Me wants my Bhaer,' | 1 |
117 | 'Now me mus tuddy my lessin | 1 |
118 | 'Now Professor | 1 |
119 | 'Governess | 1 |
120 | 'Friend of the old lady's | 1 |
121 | 'Handsome head, but no style | 1 |
122 | 'Not a bit of it. | 1 |
123 | MY PRECIOUS | 1 |
124 | A Happy New Year | 1 |
125 | Bible | 1 |
126 | Bear or Beer | 1 |
127 | Sartor Resartus | 1 |
128 | Mees Marsch | 1 |
129 | Weekly Volcano | 1 |
130 | Sonata Pathétique | 1 |
131 | A Christmas party at our hotel. | 1 |
132 | Illustration: Mornin' now]\n\n | 1 |
133 | 'Lazy Laurence | 1 |
134 | She _was_ kind | 1 |
135 | 'Lazy Laurence' | 1 |
136 | "Yours gratefully, TELEMACHUS | 1 |
137 | Illustration: The Valley of the Shadow | 1 |
138 | Opera | 1 |
139 | the Alps of Savoy | 1 |
140 | Héloise | 1 |
141 | Dear Jo | 1 |
142 | Not to-night | 1 |
143 | Please, Madam Mother | 1 |
144 | Beth | 1 |
145 | The Three Little Kittens | 1 |
146 | IN THE GARRET.\n\n | 1 |
147 | 'Meg' on the first lid, smooth and fair | 1 |
148 | 'Jo' on the next lid | 1 |
149 | My Beth | 1 |
150 | Leaving Mrs. March | 1 |
151 | Eight Cousins | 1 |
152 | Meg | 1 |
153 | How They Camped\n Out | 1 |
154 | Roses and Forget-me-nots" | 1 |
155 | How They Ran Away | 1 |
156 | Pansies | 1 |
157 | "Water-Lilies | 1 |
158 | Shadow-Children | 1 |
159 | The Moss People | 1 |
160 | Little Women | 1 |
161 | The Pickwick Papers | 1 |
162 | Betty | 1 |
163 | Head Nurse of Ward | 1 |
164 | Bold-faced\njig | 1 |
165 | Tarantella | 1 |
166 | THE LITTLE WOMEN\nSERIES | 1 |
167 | The Works of Louisa | 1 |
168 | Project\nGutenberg | 1 |
169 | the\nFoundation | 1 |
170 | Project\n | 1 |
171 | Right\nof Replacement or Refund | 1 |
Get NER in Context¶
from IPython.display import Markdown, display
import re
def get_ner_in_context(keyword, document, desired_ner_labels= False):
if desired_ner_labels != False:
desired_ner_labels = desired_ner_labels
else:
desired_ner_labels = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']
#Iterate through all the sentences in the document and pull out the text of each sentence
for sentence in document.sents:
#process each sentence
sentence_doc = nlp(sentence.text)
for named_entity in sentence_doc.ents:
#Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
if keyword.lower() in named_entity.text.lower() and named_entity.label_ in desired_ner_labels:
#Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
#sentence_text = sentence.text
sentence_text = re.sub('\n', ' ', sentence.text)
sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)
display(Markdown('---'))
display(Markdown(f"**{named_entity.label_}**"))
display(Markdown(sentence_text))
for document in chunked_documents:
get_ner_in_context('Jupiter', document)
LOC
By Jupiter
WORK_OF_ART
A crash, a cry, and a laugh from Laurie, accompanied by the indecorous exclamation, "Jupiter Ammon!
LOC
"Twins, by Jupiter!
Your Turn!¶
Now it’s your turn to take a crack at NER with a whole new text!
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
In this section, you’re going to extract and count named entities from The Autobiography of Benjamin Franklin.
Open and read the text file
filepath = "../texts/literature/The-Autobiography-of-Benjamin-Franklin.txt"
text = open(filepath, encoding='utf-8').read()
To process the book in smaller chunks (if working in Binder or on a computer with memory constraints):
chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))
To process the book all at once (if working on a computer with a larger amount of memory):
document = nlp(text)
1. Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you’ve chosen) in the book. If you need help, study the examples above.
#Your Code Here 👇
2. What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?
**#**Your answer here. (Double click this cell to type your answer.)
3. What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?
**#**Your answer here. (Double click this cell to type your answer.)
4. What’s an insight that you might be able to glean about the book based on your NER extraction?
**#**Your answer here. (Double click this cell to type your answer.)