Named Entity Recognition#
In this lesson, we’re going to learn about a text analysis method called Named Entity Recognition (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.
We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for NER with other languages:
Please reach out if you’re interested in adding another language!
Dataset#
Ada Lovelace’s Obituary & Louisa May Alcott’s Little Women#
A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843.
-Claire Cain Miller, “Ada Lovelace,” New York Times Overlooked Obituaries
Here’s a preview of spaC’s NER tagging Ada Lovelace’s obituary:
A century DATE before the dawn of the computer age, Ada Lovelace PERSON imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard PERSON loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage’s PERSON Analytical Engine PERSON , was never built. But her writings about computing have earned Lovelace PERSON — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.
The program she wrote for the Analytical Engine was to calculate the seventh ORDINAL Bernoulli PERSON number. ( Bernoulli PERSON numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.
“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “Any piece of content, data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.
“ The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”
Lovelace PERSON , a British NORP socialite who was the daughter of Lord Byron ORG , the Romantic poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”
Math PERSON “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.
Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace PERSON Day signifies a celebration of women in technology. Lovelace PERSON lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.
“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner FAC .
Babbage ORG , who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”
Augusta Ada Byron ORG was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron ORG and Annabella Milbanke ORG . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron called the “princess of parallelograms” and, after their falling out, a “mathematical Medea” — was a social reformer from a wealthy family who had a deep interest in mathematics.
An etching from a portrait of Lovelace PERSON as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/Getty Images Lovelace ORG showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.
It was Somerville GPE who introduced Lovelace PERSON to Babbage when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .
She also met her husband, William King PERSON , through Somerville GPE . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace PERSON . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.
She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville GPE : “I now read Mathematics NORP every day and am occupied in Trigonometry PRODUCT and in preliminaries to Cubic GPE and Biquadratic Equations. So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”
In 1840 DATE , Lovelace PERSON asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”
It was in 1843 DATE , when she was 27 DATE , that Lovelace PERSON wrote her most lasting contribution to computer science.
She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes PRODUCT .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.
Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace PERSON ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”
Lovelace PERSON died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.
“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”
Claire Cain Miller PERSON writes about gender for The Upshot ORG . She first ORDINAL learned about Ada Lovelace PERSON while covering the tech industry, where women are severely underrepresented.
Why is NER Useful?#
Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we’ll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we’ll also do in a later lesson!).
Natural Language Processing (NLP)#
Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, or Siri? These are all examples of NLP in action!
Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts and images are now getting eerily good.
Open-source NLP tools are getting very good, too. We’re going to use one of these open-source tools, the Python library spaCy
, for our Named Entity Recognition tasks in this lesson.
How spaCy Works#
The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace’s New York Times obituary (something that we’ll test out for ourselves below). How does spaCy know that “Ada Lovelace” is a person and that “1843” is a date?
Well, spaCy doesn’t know, not for sure anyway. Instead, spaCy is making a very educated guess. This “guess” is based on what spaCy has learned about the English language after seeing lots of other examples.
That’s a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. These texts were, in fact, often labeled and corrected by hand. This is similar to our topic modeling work from the previous lesson, except our topic model wasn’t using labeled data.
The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.
When spaCy identifies people and places in Ada Lovelace’s obituary, in other words, the NLP model is actually making predictions about the text based on what it has learned about how people and places function in English-language sentences.
NER with spaCy#
Install spaCy#
!pip install -U spacy
Import Libraries#
We’re going to import spacy
and displacy
, a special spaCy module for visualization.
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400
We’re also going to import the Counter
module for counting people, places, and things, and the pandas
library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting).
Download Language Model#
Next we need to download the English-language model (en_core_web_sm
), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm
model by running the cell below:
!python -m spacy download en_core_web_sm
Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.
spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.
Load Language Model#
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.
1. We can import the model as a module and then load it from the module.
import en_core_web_sm
nlp = en_core_web_sm.load()
2. We can load the model by name.
#nlp = spacy.load('en_core_web_sm')
If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).
Process Document#
We first need to process our document
with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.
After processing, the document
object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.
In the cell below, we open and read Ada Lovelace’s obituary. Then we runnlp()
on the text and create our document.
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)
spaCy Named Entities#
Below is a Named Entities chart for English-language spaCy taken from its website. This chart shows the different named entities that spaCy can identify as well as their corresponding type labels.
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
To quickly see spaCy’s English-language NER in action, we can use the spaCy module displacy
with the style=
parameter set to “ent” (short for entities):
displacy.render(document, style="ent")
A century DATE before the dawn of the computer age, Ada Lovelace PERSON imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843 DATE . It could not just calculate but also create, as it “weaves algebraic patterns just as the Jacquard PERSON loom weaves flowers and leaves.” The computer she was writing about, the British NORP inventor Charles Babbage’s PERSON Analytical Engine PERSON , was never built. But her writings about computing have earned Lovelace PERSON — who died of uterine cancer in 1852 DATE at 36 CARDINAL — recognition as the first ORDINAL computer programmer.
The program she wrote for the Analytical Engine was to calculate the seventh ORDINAL Bernoulli PERSON number. ( Bernoulli PERSON numbers, named after the Swiss NORP mathematician Jacob Bernoulli PERSON , are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculating numbers, she said, to understand symbols and be used to create music or art.
“This insight would become the core concept of the digital age,” Walter Isaacson PERSON wrote in his book “The Innovators WORK_OF_ART .” “Any piece of content, data or information — music, text, pictures, numbers, symbols, sounds, video — could be expressed in digital form and manipulated by machines.” She also explored the ramifications of what a computer could do, writing about the responsibility placed on the person programming the machine, and raising and then dismissing the notion that computers could someday think and create on their own — what we now call artificial intelligence.
“ The Analytical Engine WORK_OF_ART has no pretensions whatever to originate any thing,” she wrote. “It can do whatever we know how to order it to perform.”
Lovelace PERSON , a British NORP socialite who was the daughter of Lord Byron ORG , the Romantic poet, had a gift for combining art and science, one of her biographers, Betty Alexandra Toole PERSON , has written. She thought of math and logic as creative and imaginative, and called it “poetical science.”
Math PERSON “constitutes the language through which alone we can adequately express the great facts of the natural world,” Lovelace PERSON wrote.
Her work, which was rediscovered in the mid-20th century DATE , inspired the Defense Department ORG to name a programming language after her and each October DATE Ada Lovelace PERSON Day signifies a celebration of women in technology. Lovelace PERSON lived when women were not considered to be prominent scientific thinkers, and her skills were often described as masculine.
“With an understanding thoroughly masculine in solidity, grasp and firmness, Lady Lovelace PERSON had all the delicacies of the most refined female character,” said an obituary in The London Examiner FAC .
Babbage ORG , who called her the “enchantress of numbers,” once wrote that she “has thrown her magical spell around the most abstract of Sciences and has grasped it with a force which few masculine intellects (in our own country at least) could have exerted over it.”
Augusta Ada Byron ORG was born on Dec. 10, 1815 DATE , in London GPE , to Lord Byron ORG and Annabella Milbanke ORG . Her parents separated when she was an infant, and her father died when she was 8 DATE . Her mother — whom Lord Byron called the “princess of parallelograms” and, after their falling out, a “mathematical Medea” — was a social reformer from a wealthy family who had a deep interest in mathematics.
An etching from a portrait of Lovelace PERSON as a child. She is said to have had a gift for combining art and science. Smith Collection/Gado/Getty Images Lovelace ORG showed a passion for math and mechanics from a young age, encouraged by her mother. Because of her class, she had access to private tutors and to intellectuals in British NORP scientific and literary society. She was insatiably curious and surrounded herself with big thinkers of the day DATE , including Mary Somerville PERSON , a scientist and writer.
It was Somerville GPE who introduced Lovelace PERSON to Babbage when she was 17 DATE , at a salon he hosted soon after she made her society debut. He showed her a two-foot QUANTITY high, brass mechanical calculator he had built, and it gripped her imagination. They began a correspondence about math and science that lasted almost two decades DATE .
She also met her husband, William King PERSON , through Somerville GPE . They married in 1835 DATE , when she was 19 DATE . He soon became an earl, and she became the Countess of Lovelace PERSON . By 1839 DATE , she had given birth to two CARDINAL sons and a daughter.
She was determined, however, not to let her family life slow her work. The year she was married, she wrote to Somerville GPE : “I now read Mathematics NORP every day and am occupied in Trigonometry PRODUCT and in preliminaries to Cubic GPE and Biquadratic Equations. So you see that matrimony has by no means lessened my taste for these pursuits, nor my determination to carry them on.”
In 1840 DATE , Lovelace PERSON asked Augustus De Morgan PERSON , a math professor in London GPE , to tutor her. Through exchanging letters, he taught her university-level math. He later wrote to her mother that if a young male student had shown her skill, “they would have certainly made him an original mathematical investigator, perhaps of first ORDINAL -rate eminence.”
It was in 1843 DATE , when she was 27 DATE , that Lovelace PERSON wrote her most lasting contribution to computer science.
She published her translation of an academic paper about the Babbage Analytical Engine ORG and added a section, nearly three CARDINAL times the length of the paper, titled, “ Notes PRODUCT .” Here, she described how the computer would work, imagined its potential and wrote the first ORDINAL program.
Researchers have come to see it as “an extraordinary document,” said Ursula Martin PERSON , a computer scientist at the University of Oxford ORG who has studied Lovelace PERSON ’s life and work. “She’s talking about the abstract principles of computation, how you could program it, and big ideas like maybe it could compose music, maybe it could think.”
Lovelace PERSON died less than a decade later DATE , on Nov. 27, 1852 DATE . In the “ Notes PRODUCT ,” she imagined a future in which computers could do more powerful and faster analysis than humans.
“A new, a vast and a powerful language is developed for the future use of analysis,” she wrote, “in which to wield its truths so that these may become of more speedy and accurate practical application for the purposes of mankind.”
Claire Cain Miller PERSON writes about gender for The Upshot ORG . She first ORDINAL learned about Ada Lovelace PERSON while covering the tech industry, where women are severely underrepresented.
From a quick glance at the text above, we can see that the English-language spaCy model is doing quite well with NER. But it’s definitely not perfect.
Though spaCy correctly identifies “Ada Lovelace” as a PERSON
in the first sentence, just a few sentences later it labels her as a WORK_OF_ART
. Though spaCy correctly identifies “London” as a place GPE
a few paragraphs down, it incorrectly identifies “Jacquard” as a place GPE
, too (when really “Jacquard” is a type of loom, named after Marie Jacquard).
This inconsistency is very important to note and keep in mind. If we wanted to use spaCy’s English-language NER model for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn’t be perfect. That’s why understanding the limitations of this tool is so crucial. While spaCy’s English-language NER can be very good for identifying entities in broad strokes, it can’t be relied upon for anything exact and fine-grained — not out of the box anyway.
Get Named Entities#
All the named entities in our document
can be found in the document.ents
property. If we check out document.ents
, we can see all the entities from Ada Lovelace’s obituary.
document.ents
(first,
CLAIRE CAIN MILLER,
A century,
Ada Lovelace,
1843,
Jacquard,
British,
Charles Babbage’s,
Analytical Engine,
Lovelace,
1852,
36,
first,
seventh,
Bernoulli,
Bernoulli,
Swiss,
Jacob Bernoulli,
Walter Isaacson,
“The Innovators,
The Analytical Engine,
Lovelace,
British,
Lord Byron,
Betty Alexandra Toole,
Math,
Lovelace,
the mid-20th century,
the Defense Department,
October,
Ada Lovelace,
Lovelace,
Lady Lovelace,
The London Examiner,
Babbage,
Augusta Ada Byron,
Dec. 10, 1815,
London,
Lord Byron,
Annabella Milbanke,
8,
Lovelace,
Smith Collection/Gado/Getty Images
Lovelace,
British,
the day,
Mary Somerville,
Somerville,
Lovelace,
17,
two-foot,
almost two decades,
William King,
Somerville,
1835,
19,
Lovelace,
1839,
two,
Somerville,
Mathematics,
Trigonometry,
Cubic,
1840,
Lovelace,
Augustus De Morgan,
London,
first,
1843,
27,
Lovelace,
the Babbage Analytical Engine,
nearly three,
Notes,
first,
Ursula Martin,
the University of Oxford,
Lovelace,
Lovelace,
less than a decade later,
Nov. 27, 1852,
Notes,
Claire Cain Miller,
The Upshot,
first,
Ada Lovelace)
Each of the named entities in document.ents
contains more information about itself, which we can access by iterating through the document.ents
with a simple for
loop.
For each named_entity
in document.ents
, we will extract the named_entity
and its corresponding named_entity.label_
.
for named_entity in document.ents:
print(named_entity, named_entity.label_)
first ORDINAL
CLAIRE CAIN MILLER PERSON
A century DATE
Ada Lovelace PERSON
1843 DATE
Jacquard PERSON
British NORP
Charles Babbage’s PERSON
Analytical Engine PERSON
Lovelace PERSON
1852 DATE
36 CARDINAL
first ORDINAL
seventh ORDINAL
Bernoulli PERSON
Bernoulli PERSON
Swiss NORP
Jacob Bernoulli PERSON
Walter Isaacson PERSON
“The Innovators WORK_OF_ART
The Analytical Engine WORK_OF_ART
Lovelace PERSON
British NORP
Lord Byron ORG
Betty Alexandra Toole PERSON
Math PERSON
Lovelace PERSON
the mid-20th century DATE
the Defense Department ORG
October DATE
Ada Lovelace PERSON
Lovelace PERSON
Lady Lovelace PERSON
The London Examiner FAC
Babbage ORG
Augusta Ada Byron ORG
Dec. 10, 1815 DATE
London GPE
Lord Byron ORG
Annabella Milbanke ORG
8 DATE
Lovelace PERSON
Smith Collection/Gado/Getty Images
Lovelace ORG
British NORP
the day DATE
Mary Somerville PERSON
Somerville GPE
Lovelace PERSON
17 DATE
two-foot QUANTITY
almost two decades DATE
William King PERSON
Somerville GPE
1835 DATE
19 DATE
Lovelace PERSON
1839 DATE
two CARDINAL
Somerville GPE
Mathematics NORP
Trigonometry PRODUCT
Cubic GPE
1840 DATE
Lovelace PERSON
Augustus De Morgan PERSON
London GPE
first ORDINAL
1843 DATE
27 DATE
Lovelace PERSON
the Babbage Analytical Engine ORG
nearly three CARDINAL
Notes PRODUCT
first ORDINAL
Ursula Martin PERSON
the University of Oxford ORG
Lovelace PERSON
Lovelace PERSON
less than a decade later DATE
Nov. 27, 1852 DATE
Notes PRODUCT
Claire Cain Miller PERSON
The Upshot ORG
first ORDINAL
Ada Lovelace PERSON
To extract just the named entities that have been identified as PERSON
, we can add a simple if
statement into the mix:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
print(named_entity)
CLAIRE CAIN MILLER
Ada Lovelace
Jacquard
Charles Babbage’s
Analytical Engine
Lovelace
Bernoulli
Bernoulli
Jacob Bernoulli
Walter Isaacson
Lovelace
Betty Alexandra Toole
Math
Lovelace
Ada Lovelace
Lovelace
Lady Lovelace
Lovelace
Mary Somerville
Lovelace
William King
Lovelace
Lovelace
Augustus De Morgan
Lovelace
Ursula Martin
Lovelace
Lovelace
Claire Cain Miller
Ada Lovelace
NER with Long Texts or Many Texts#
For the rest of this lesson, we’re going to work with Edward P. Jones’s short story collection Lost in the City, specifically the first story, “The Girl Who Raised Pigeons.”
filepath = "../texts/literature/Little-Women_Louisa-May-Alcott.txt"
text = open(filepath).read()
import math
number_of_chunks = 80
chunk_size = math.ceil(len(text) / number_of_chunks)
text_chunks = []
for number in range(0, len(text), chunk_size):
text_chunk = text[number:number+chunk_size]
text_chunks.append(text_chunk)
chunked_documents = list(nlp.pipe(text_chunks))
Get People#
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
To extract and count the people, we will use an if
statement that will pull out words only if their “ent” label matches “PERSON.”
Pandas Review
Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!
people = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "PERSON":
people.append(named_entity.text)
people_tally = Counter(people)
df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df
character | count | |
---|---|---|
0 | Jo | 1256 |
1 | Amy | 645 |
2 | Laurie | 570 |
3 | Beth | 465 |
4 | Meg | 311 |
5 | John | 144 |
6 | Hannah | 122 |
7 | Brooke | 96 |
8 | Laurence | 85 |
9 | Bhaer | 77 |
10 | Teddy | 51 |
11 | Fred | 40 |
12 | Project Gutenberg | 40 |
13 | Daisy | 29 |
14 | Kate | 27 |
15 | Demi | 24 |
16 | Moffat | 22 |
17 | Margaret | 21 |
18 | Davis | 18 |
19 | Ned | 18 |
20 | Flo | 18 |
21 | Aunt | 15 |
22 | Frank | 15 |
23 | Dashwood | 15 |
24 | Esther | 13 |
25 | Marmee | 12 |
26 | Sallie | 12 |
27 | Zara | 11 |
28 | Scott | 11 |
29 | Roderigo | 10 |
30 | Chester | 10 |
31 | Hagar | 9 |
32 | Annie | 9 |
33 | Jo\n | 9 |
34 | John Brooke | 9 |
35 | Kirke | 9 |
36 | Fritz | 9 |
37 | Tina | 8 |
38 | Bethy | 7 |
39 | Carrol | 7 |
40 | Don Pedro | 6 |
41 | Shakespeare | 6 |
42 | Gardiner | 6 |
43 | God | 6 |
44 | Joanna | 6 |
45 | Lotty | 6 |
46 | Bangs | 6 |
47 | K. | 6 |
48 | Alcott | 6 |
49 | Project\nGutenberg | 6 |
50 | Jo decidedly | 5 |
51 | Jenny | 5 |
52 | Pickwick | 5 |
53 | Miss Crocker | 5 |
54 | Hummel | 5 |
55 | March | 5 |
56 | Tudor | 5 |
57 | Kitty | 5 |
58 | Grandpa | 4 |
59 | Annie Moffat | 4 |
60 | Raphael | 4 |
61 | King | 4 |
62 | Clara | 4 |
63 | Miss Belle | 4 |
64 | Ned Moffat | 4 |
65 | Crocker | 4 |
66 | Grace | 4 |
67 | Mary | 4 |
68 | Ellen Tree | 4 |
69 | Down | 4 |
70 | JO | 4 |
71 | Norton | 4 |
72 | Gott | 4 |
73 | Friedrich | 4 |
74 | Gutenberg | 4 |
75 | Josephine | 3 |
76 | Belsham | 3 |
77 | Susie | 3 |
78 | Cutter | 3 |
79 | Snow | 3 |
80 | Annie Moffat's | 3 |
81 | George | 3 |
82 | Belle | 3 |
83 | Miss | 3 |
84 | Tupman | 3 |
85 | Snodgrass | 3 |
86 | Sallie Gardiner | 3 |
87 | Fred Vaughn | 3 |
88 | pell-mell | 3 |
89 | David | 3 |
90 | Jimmy | 3 |
91 | gravely,-- | 3 |
92 | Hush | 3 |
93 | Jo felt | 3 |
94 | Sallie Moffat | 3 |
95 | Randal | 3 |
96 | Grundy | 3 |
97 | Presently Jo | 3 |
98 | Jack | 3 |
99 | Lamb | 3 |
100 | Miss Lamb | 3 |
101 | Thou | 3 |
102 | Friedrich Bhaer | 3 |
103 | Project Gutenberg-tm | 3 |
104 | Elizabeth | 2 |
105 | Don Pedro's | 2 |
106 | Theodore | 2 |
107 | Dora | 2 |
108 | gratefully,-- | 2 |
109 | Jo\neagerly | 2 |
110 | JAMES LAURENCE | 2 |
111 | Jenny Snow | 2 |
112 | Edgeworth | 2 |
113 | Lincoln | 2 |
114 | Nan | 2 |
115 | Jo stoutly | 2 |
116 | Dickens | 2 |
117 | Samuel Pickwick | 2 |
118 | Tracy Tupman | 2 |
119 | Nathaniel Winkle | 2 |
120 | Antonio | 2 |
121 | Sam Weller | 2 |
122 | Longmeadow | 2 |
123 | bush | 2 |
124 | Bon | 2 |
125 | Uncle | 2 |
126 | Kitty Bryant's | 2 |
127 | Johnson | 2 |
128 | Jo\ncarried | 2 |
129 | Jo warmly | 2 |
130 | John\n | 2 |
131 | chasséed | 2 |
132 | Jove | 2 |
133 | Eliott | 2 |
134 | Cornelius | 2 |
135 | Dove | 2 |
136 | Demijohn | 2 |
137 | Aunt Carrol | 2 |
138 | May | 2 |
139 | Killarney | 2 |
140 | Kate Kearney | 2 |
141 | Mees Marsch | 2 |
142 | Mamma | 2 |
143 | Plato | 2 |
144 | chubby | 2 |
145 | Mozart | 2 |
146 | homesick | 2 |
147 | Minna | 2 |
148 | Rob | 2 |
149 | Ted | 2 |
150 | LULU | 2 |
151 | Betty | 2 |
152 | Louisa May Alcott | 2 |
153 | Charles Dickens | 2 |
154 | Lulu | 2 |
155 | Undine | 1 |
156 | Sintram | 1 |
157 | Faber | 1 |
158 | Operatic Tragedy | 1 |
159 | Hurry | 1 |
160 | Meg\nwarmly | 1 |
161 | Presently Beth | 1 |
162 | Die Engel-kinder | 1 |
163 | Santa Claus | 1 |
164 | on,--when | 1 |
165 | Jo! | 1 |
166 | Miss Josephine | 1 |
167 | Christopher | 1 |
168 | Quel | 1 |
169 | pantoufles jolis | 1 |
170 | Laurie\ngood-naturedly | 1 |
171 | Buzz | 1 |
172 | arnica | 1 |
173 | Kings | 1 |
174 | Florence | 1 |
175 | Maria\nParks | 1 |
176 | Belsham]\n\n | 1 |
177 | Ellen | 1 |
178 | Susie Perkins | 1 |
179 | Chloe | 1 |
180 | Tom | 1 |
181 | brown house | 1 |
182 | Theodore\nLaurence | 1 |
183 | Jo arm-in | 1 |
184 | Laurie\nmount guard | 1 |
185 | Beth\nhid | 1 |
186 | James Laurence' | 1 |
187 | Amy March | 1 |
188 | Katy Brown | 1 |
189 | Mary Kingsley | 1 |
190 | Miss Snow | 1 |
191 | Blimber | 1 |
192 | Hem | 1 |
193 | Jo\nappeared | 1 |
194 | JO MEETS APOLLYON | 1 |
195 | Jo\ncrossly | 1 |
196 | Jo\nforgot | 1 |
197 | Mrs M. | 1 |
198 | M. | 1 |
199 | Miss\nBelle | 1 |
200 | Cinderella | 1 |
201 | Fisher | 1 |
202 | Meg]\n\n | 1 |
203 | &c.]\n\n | 1 |
204 | Knights | 1 |
205 | Tis | 1 |
206 | Unmask | 1 |
207 | the P. C. | 1 |
208 | Snowball Pat | 1 |
209 | Snowball | 1 |
210 | Pickwick Hall | 1 |
211 | Hannah Brown | 1 |
212 | BETH BOUNCER | 1 |
213 | Avenger | 1 |
214 | S. P. | 1 |
215 | A. S. | 1 |
216 | T. T. | 1 |
217 | N. W. | 1 |
218 | bona fide | 1 |
219 | Winkle | 1 |
220 | martin-house | 1 |
221 | Weller | 1 |
222 | Jo\nregarded | 1 |
223 | The P. O. | 1 |
224 | Sairy Gamp | 1 |
225 | Katy\nBrown's | 1 |
226 | Flora McFlimsey | 1 |
227 | Boaz | 1 |
228 | Jo hurried | 1 |
229 | Croaker | 1 |
230 | Laurie wrote,-- | 1 |
231 | Kate Vaughn | 1 |
232 | Sunshine | 1 |
233 | Barker | 1 |
234 | Fred; Laurie took Sallie | 1 |
235 | Jo\nangrily | 1 |
236 | Miss\nMarch | 1 |
237 | Count Gustave | 1 |
238 | What's | 1 |
239 | Thankee | 1 |
240 | Bosen | 1 |
241 | Fred, Sallie | 1 |
242 | Laurie to Jo | 1 |
243 | John Bull | 1 |
244 | Jo\nnodded | 1 |
245 | Truth | 1 |
246 | Mary Stuart | 1 |
247 | Meg heartily | 1 |
248 | Grace of Amy | 1 |
249 | Fred and Kate | 1 |
250 | refrain,-- | 1 |
251 | Yes'm | 1 |
252 | La | 1 |
253 | Beth\nmeekly | 1 |
254 | Laurie heartily | 1 |
255 | Ashamed | 1 |
256 | Bent | 1 |
257 | billiard saloon | 1 |
258 | tipsy | 1 |
259 | Angelo | 1 |
260 | Miss Burney | 1 |
261 | Lady Something | 1 |
262 | MARCH | 1 |
263 | Thomas | 1 |
264 | bag,-- | 1 |
265 | Greatheart | 1 |
266 | Meggy | 1 |
267 | Laurie I | 1 |
268 | Kiss | 1 |
269 | MA | 1 |
270 | contradick | 1 |
271 | Chick | 1 |
272 | Hattie King | 1 |
273 | Jo doos | 1 |
274 | eatin sweet | 1 |
275 | Yours Respectful | 1 |
276 | Hannah Mullet | 1 |
277 | Quartermaster Mullett | 1 |
278 | Lion | 1 |
279 | TEDDY | 1 |
280 | Puck | 1 |
281 | Call Meg | 1 |
282 | baker | 1 |
283 | Madam | 1 |
284 | Estelle | 1 |
285 | Mademoiselle | 1 |
286 | Testament | 1 |
287 | Jo\nafterward | 1 |
288 | Allyluyer | 1 |
289 | Amy Curtis | 1 |
290 | Theodore Laurence | 1 |
291 | Kitty Bryant | 1 |
292 | Anni | 1 |
293 | ESTELLE | 1 |
294 | {\n | 1 |
295 | Laurie soberly | 1 |
296 | Sabbath | 1 |
297 | Peggy | 1 |
298 | Caroline Percy | 1 |
299 | Mother | 1 |
300 | Jo\nsoothingly | 1 |
301 | Jo seriously | 1 |
302 | Sam | 1 |
303 | Rambler | 1 |
304 | wrapper,--was | 1 |
305 | Queen Bess | 1 |
306 | Madonna | 1 |
307 | Child | 1 |
308 | bilin | 1 |
309 | Jo\nblundered | 1 |
310 | Cook | 1 |
311 | independent,--so | 1 |
312 | James Laurence | 1 |
313 | Book | 1 |
314 | Sister Jo | 1 |
315 | grace,--a | 1 |
316 | Toodles | 1 |
317 | John begin | 1 |
318 | Parker | 1 |
319 | Gummidge | 1 |
320 | Mark | 1 |
321 | lace | 1 |
322 | Grecian | 1 |
323 | Jupiter Ammon | 1 |
324 | Uncle Carrol | 1 |
325 | Bacchus | 1 |
326 | Juliet | 1 |
327 | Michael Angelo | 1 |
328 | Maria Theresa | 1 |
329 | fête | 1 |
330 | Literary Lessons]\n\n XXVII | 1 |
331 | a People's Course | 1 |
332 | S. L. A. N. G. Northbury | 1 |
333 | Belzoni | 1 |
334 | Aim | 1 |
335 | Allen | 1 |
336 | Martha | 1 |
337 | jell | 1 |
338 | Jack Scott | 1 |
339 | Mantalini | 1 |
340 | Ned Moffat's | 1 |
341 | John dryly | 1 |
342 | Shut | 1 |
343 | Uncle Teddy | 1 |
344 | John Laurence | 1 |
345 | Megs | 1 |
346 | XXIX | 1 |
347 | Shylock | 1 |
348 | Maud | 1 |
349 | satin | 1 |
350 | May Chester's | 1 |
351 | Tom Brown | 1 |
352 | Tommy Chamberlain | 1 |
353 | Tommy | 1 |
354 | ones slip | 1 |
355 | May Chester | 1 |
356 | Lambs | 1 |
357 | do,--took | 1 |
358 | Hayes | 1 |
359 | Miss Jo | 1 |
360 | Lady Bountiful | 1 |
361 | Shun | 1 |
362 | Lennox | 1 |
363 | Ward | 1 |
364 | Robert Lennox's | 1 |
365 | Aunt Mary | 1 |
366 | Route de Roi | 1 |
367 | Noah | 1 |
368 | Fechter | 1 |
369 | Frank Vaughn | 1 |
370 | Beth Frank | 1 |
371 | Vaughns | 1 |
372 | Lawrence | 1 |
373 | Hogarth | 1 |
374 | Fred and Frank | 1 |
375 | parley vooing | 1 |
376 | cafés | 1 |
377 | Marie Antoinette's | 1 |
378 | Charlemagne | 1 |
379 | bijouterie | 1 |
380 | Bois | 1 |
381 | Chaise | 1 |
382 | Berne | 1 |
383 | Ariadne | 1 |
384 | showy | 1 |
385 | Blöndchen | 1 |
386 | Neckar | 1 |
387 | hands,--and | 1 |
388 | Jo said,-- | 1 |
389 | Cock | 1 |
390 | Bonnie Dundee | 1 |
391 | JO'S | 1 |
392 | Mabel | 1 |
393 | Thou shalt | 1 |
394 | Handsome | 1 |
395 | Lager Beer | 1 |
396 | Ursa Major | 1 |
397 | hose | 1 |
398 | Lucifer | 1 |
399 | nargerie_ | 1 |
400 | "P. S. On | 1 |
401 | Franz | 1 |
402 | L. | 1 |
403 | homey | 1 |
404 | Milton | 1 |
405 | Malaprop | 1 |
406 | Nick Bottom | 1 |
407 | Espagne | 1 |
408 | Jacks | 1 |
409 | compensation-- | 1 |
410 | Northbury | 1 |
411 | grammar | 1 |
412 | désillusionée_ | 1 |
413 | Kant | 1 |
414 | Hegel | 1 |
415 | Schiller | 1 |
416 | irresistible,-- | 1 |
417 | Jo stuffed | 1 |
418 | Sherwood | 1 |
419 | Sabbath-school | 1 |
420 | Hail | 1 |
421 | BETH | 1 |
422 | Jo saw\n | 1 |
423 | rebuke Jo | 1 |
424 | me,--busy | 1 |
425 | Victor Emmanuel | 1 |
426 | buff | 1 |
427 | Place Napoleon | 1 |
428 | Que | 1 |
429 | blasé_ | 1 |
430 | Avigdor | 1 |
431 | Corso | 1 |
432 | Schubert | 1 |
433 | mouchoir | 1 |
434 | Junoesque | 1 |
435 | Diana | 1 |
436 | Serene Something | 1 |
437 | Rothschild | 1 |
438 | Lady de Jones | 1 |
439 | satin train | 1 |
440 | Serene Teuton | 1 |
441 | Vladimir | 1 |
442 | Balzac | 1 |
443 | tulle | 1 |
444 | Daisy]\n\n | 1 |
445 | Vive la | 1 |
446 | Babyland | 1 |
447 | Englishman | 1 |
448 | Baby | 1 |
449 | Mornin | 1 |
450 | beseechingly,-- | 1 |
451 | John kissed | 1 |
452 | Brookes | 1 |
453 | Sallie Moffatt | 1 |
454 | Saxon | 1 |
455 | Laurie one | 1 |
456 | Meek | 1 |
457 | Alps | 1 |
458 | Dolce | 1 |
459 | Raphaella | 1 |
460 | Jouvin | 1 |
461 | mon | 1 |
462 | Au | 1 |
463 | glad | 1 |
464 | Aunty Beth | 1 |
465 | Jo\n_ | 1 |
466 | Requiem | 1 |
467 | Beethoven | 1 |
468 | Bach | 1 |
469 | Vevay | 1 |
470 | La Tour | 1 |
471 | Lausanne | 1 |
472 | Rousseau | 1 |
473 | tableau | 1 |
474 | Providence | 1 |
475 | Jo\nglanced | 1 |
476 | t.\n | 1 |
477 | Aunt Priscilla | 1 |
478 | Jo exclaim,-- | 1 |
479 | ab\nlibitum_ | 1 |
480 | Jo hung | 1 |
481 | Aristotle | 1 |
482 | Aunt Amy | 1 |
483 | Aunt Beth | 1 |
484 | Jo\nneglected | 1 |
485 | Jo some | 1 |
486 | Fessor | 1 |
487 | Hoffmann | 1 |
488 | Jo hastily | 1 |
489 | Jo\nbashfully | 1 |
490 | Catherine | 1 |
491 | Jo\ndecidedly | 1 |
492 | menagerie | 1 |
493 | Mother Bhaer | 1 |
494 | Bhaers | 1 |
495 | Cowley | 1 |
496 | Columella | 1 |
497 | Grandma | 1 |
498 | Unlucky Jo' | 1 |
499 | Amy warmly | 1 |
500 | Louisa M. Alcott's | 1 |
501 | Writings | 1 |
502 | Reginald B. Birch | 1 |
503 | Alice Barber Stephens | 1 |
504 | Jessie Willcox Smith | 1 |
505 | Harriet Roosevelt Richards | 1 |
506 | Girls.= | 1 |
507 | JO'S SCRAP-BAG | 1 |
508 | Amy\n\nIllustrated | 1 |
509 | Rose | 1 |
510 | Ben | 1 |
511 | Bob | 1 |
512 | COMIC | 1 |
513 | Foreword | 1 |
514 | Ednah D. Cheney | 1 |
515 | Sol Eytinge | 1 |
516 | Marjorie | 1 |
517 | Ethel | 1 |
518 | ASTOR | 1 |
519 | Aunt Wee | 1 |
520 | Elizabeth L. Gould | 1 |
521 | Little Women | 1 |
522 | underscores | 1 |
523 | Webster | 1 |
524 | Charles\nDickens | 1 |
525 | Augustus Snodgrass | 1 |
526 | Samuel Weller | 1 |
527 | Betsey | 1 |
528 | Tarantula | 1 |
529 | David Copperfield | 1 |
530 | A. M. Barnard | 1 |
531 | Jo's Scrap-Bag | 1 |
532 | Louisa M. Alcott | 1 |
533 | Michael\nHart | 1 |
534 | Project Gutenberg-tm's | 1 |
535 | S.\nFairbanks | 1 |
536 | Gregory B. Newby | 1 |
537 | Michael S. Hart | 1 |
Get Places#
Type Label |
Description |
---|---|
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
To extract and count places, we can follow the same model as above, except we will change our if
statement to check for “ent” labels that match “GPE” or “LOC.” These are the type labels for “counties cities, states” and “locations, mountain ranges, bodies of water.”
places = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
places.append(named_entity.text)
places_tally = Counter(places)
df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df
place | count | |
---|---|---|
0 | Washington | 13 |
1 | Nice | 12 |
2 | Paris | 10 |
3 | Belle | 9 |
4 | china | 8 |
5 | Rome | 8 |
6 | Tina | 8 |
7 | Demi | 8 |
8 | America | 7 |
9 | Plumfield | 7 |
10 | London | 7 |
11 | Bhaer | 7 |
12 | the United States | 7 |
13 | Hummels | 6 |
14 | Switzerland | 6 |
15 | Laurie | 6 |
16 | Germany | 6 |
17 | Vevay | 5 |
18 | Hum | 5 |
19 | Celestial City | 4 |
20 | Tudor | 4 |
21 | Valley | 3 |
22 | Sancho | 3 |
23 | Italy | 3 |
24 | France | 3 |
25 | New York | 3 |
26 | Berlin | 3 |
27 | Paradise | 3 |
28 | U.S. | 3 |
29 | Europe | 2 |
30 | quaver | 2 |
31 | Egypt | 2 |
32 | Satan | 2 |
33 | Jupiter | 2 |
34 | India | 2 |
35 | earth | 2 |
36 | Lottchen | 2 |
37 | Minna | 2 |
38 | A.M. | 2 |
39 | Baden-Baden | 2 |
40 | Scotts | 2 |
41 | 1.E.8 | 2 |
42 | the United\nStates | 2 |
43 | China | 1 |
44 | South | 1 |
45 | the City of Destruction | 1 |
46 | the Slough of Despond | 1 |
47 | Asia | 1 |
48 | Africa | 1 |
49 | Columbus | 1 |
50 | maroon | 1 |
51 | IV | 1 |
52 | Vicar | 1 |
53 | Kings | 1 |
54 | the Diamond Lake | 1 |
55 | Bremer | 1 |
56 | Moffats | 1 |
57 | Chiny | 1 |
58 | Bacon | 1 |
59 | Milton | 1 |
60 | capital,--so | 1 |
61 | Shadowy | 1 |
62 | Canada | 1 |
63 | Sahara | 1 |
64 | Atalanta | 1 |
65 | Breakfast | 1 |
66 | Pewmonia | 1 |
67 | Rappahannock | 1 |
68 | Heinrich | 1 |
69 | Mentor | 1 |
70 | Thou | 1 |
71 | us | 1 |
72 | Hercules | 1 |
73 | Romeo | 1 |
74 | Lisbon | 1 |
75 | Spiritualism | 1 |
76 | Jo | 1 |
77 | LONDON | 1 |
78 | Liverpool | 1 |
79 | Briton | 1 |
80 | Yankee | 1 |
81 | Kenilworth | 1 |
82 | Regent Street | 1 |
83 | Hyde Park | 1 |
84 | Wellington | 1 |
85 | Punch | 1 |
86 | Hampton | 1 |
87 | Richmond Park | 1 |
88 | Saint Denis | 1 |
89 | the Tuileries Gardens | 1 |
90 | Rhine | 1 |
91 | Bonn | 1 |
92 | Nassau | 1 |
93 | Byronic | 1 |
94 | NEW YORK | 1 |
95 | Corinne | 1 |
96 | Kirke | 1 |
97 | quarrel,--we | 1 |
98 | Sandwich | 1 |
99 | the Jardin Publique | 1 |
100 | Chauvain | 1 |
101 | Greece | 1 |
102 | Tarlatan | 1 |
103 | Cardiglia | 1 |
104 | Tarantula | 1 |
105 | Davises | 1 |
106 | india | 1 |
107 | character,--we | 1 |
108 | Babydom | 1 |
109 | Monaco | 1 |
110 | Baptiste | 1 |
111 | Mediterranean | 1 |
112 | Requiem | 1 |
113 | Vienna | 1 |
114 | Mendelssohn | 1 |
115 | St. Gingolf | 1 |
116 | Clarens | 1 |
117 | Weathercock | 1 |
118 | St. Martin | 1 |
119 | marmar | 1 |
120 | Truly | 1 |
121 | Hamburg | 1 |
122 | Professorin | 1 |
123 | West | 1 |
124 | Pomonas | 1 |
125 | Garland | 1 |
126 | WASHINGTON | 1 |
127 | BOSTON | 1 |
128 | New\n England | 1 |
129 | New Hampshire | 1 |
130 | Washington St. | 1 |
131 | Boston | 1 |
132 | Mass. | 1 |
133 | Passages | 1 |
134 | N. Winkle's | 1 |
135 | Fairfield | 1 |
136 | a United States | 1 |
137 | Replacement | 1 |
138 | Mississippi | 1 |
139 | Salt Lake City | 1 |
Get Streets & Parks#
Type Label |
Description |
---|---|
FAC |
Buildings, airports, highways, bridges, etc. |
To extract and count streets and parks (which show up a lot in Lost in the City!), we can follow the same model as above, except we will change our if
statement to check for “ent” labels that match “FAC.” This is the type label for “buildings, airports, highways, bridges, etc.”
streets = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "FAC":
streets.append(named_entity.text)
streets_tally = Counter(streets)
df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df
street | count | |
---|---|---|
0 | Pickwick Hall | 1 |
1 | the Earl of Devereux | 1 |
2 | the Tower of Babel | 1 |
3 | the Barnville Theatre | 1 |
4 | Loved | 1 |
5 | Camp Laurence | 1 |
6 | the moon | 1 |
7 | the\ngate | 1 |
8 | Sphinx | 1 |
9 | the Bath Hotel | 1 |
10 | the Rue de Rivoli | 1 |
11 | Castle Hill | 1 |
12 | Difficulty | 1 |
13 | Camp | 1 |
14 | Page 123 | 1 |
15 | Page 124 | 1 |
16 | Page 411 | 1 |
17 | Page 413 | 1 |
18 | The Children's Friend Series | 1 |
19 | the Project Gutenberg™ License | 1 |
Get Works of Art#
Type Label |
Description |
---|---|
WORK_OF_ART |
Titles of books, songs, etc. |
To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we’re going to make our code even more economical and efficient (while still changing our if
statement to match the “ent” label “WORK_OF_ART”).
works_of_art = []
for document in chunked_documents:
for named_entity in document.ents:
if named_entity.label_ == "WORK_OF_ART":
works_of_art.append(named_entity.text)
art_tally = Counter(works_of_art)
df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df
work_of_art | count | |
---|---|---|
0 | Meg | 4 |
1 | Merry Christmas | 3 |
2 | Aunt March | 3 |
3 | Dear me | 3 |
4 | the List of Illustrations | 3 |
5 | the "Heir of Redclyffe | 2 |
6 | Christopher Columbus | 2 |
7 | the "Busy Bee Society | 2 |
8 | Hamlet | 2 |
9 | O Laurie | 2 |
10 | Come, Jo | 2 |
11 | O Teddy | 2 |
12 | Love | 2 |
13 | Aunt Dodo | 2 |
14 | Little Men | 2 |
15 | Eight Cousins | 2 |
16 | Plain Vanilla ASCII | 2 |
17 | Little Tranquillity | 1 |
18 | Laurie Laurence,--what | 1 |
19 | Belsham's Essays by the hour | 1 |
20 | Vicar of Wakefield' | 1 |
21 | 'Yes, ma'am | 1 |
22 | the "Mansion of Bliss | 1 |
23 | Dear Madam_,-- | 1 |
24 | Young ladies, you remember | 1 |
25 | Is Laurie an | 1 |
26 | Keep | 1 |
27 | Daisy March | 1 |
28 | Dear me! | 1 |
29 | Shall I go away | 1 |
30 | Aunt\nCockle-top | 1 |
31 | The Pickwick Portfolio | 1 |
32 | Poet's Corner | 1 |
33 | Lecture on "WOMAN AND HER POSITION | 1 |
34 | The Wide, Wide World | 1 |
35 | Scarlet | 1 |
36 | 'The Sea-Lion | 1 |
37 | Dancing and French | 1 |
38 | Young ladies in America | 1 |
39 | Dear, how charming! | 1 |
40 | Illustration: Swinging to and fro | 1 |
41 | Please | 1 |
42 | Spirits | 1 |
43 | Delectable Mountain | 1 |
44 | Hurrah for Miss March | 1 |
45 | Spread Eagles | 1 |
46 | The Rival Painters | 1 |
47 | Won't Laurie laugh | 1 |
48 | Where's Laurie | 1 |
49 | 'Take it | 1 |
50 | 'Hope | 1 |
51 | A SONG FROM THE SUDS | 1 |
52 | Yours Respectful | 1 |
53 | "Water | 1 |
54 | About Meg | 1 |
55 | Stop, Jo | 1 |
56 | O Meg | 1 |
57 | Where is Laurie | 1 |
58 | Hanged if I do! | 1 |
59 | Hang the 'Rambler | 1 |
60 | So am I | 1 |
61 | Shall I tell you how | 1 |
62 | The Spread Eagle | 1 |
63 | Mother and I | 1 |
64 | Mother and I are going to wait for John | 1 |
65 | Run, Beth, | 1 |
66 | Spread Eagle | 1 |
67 | Send Beth | 1 |
68 | Curse of the Coventrys | 1 |
69 | Receipt\nBook | 1 |
70 | Daisy and Demi,--just the thing | 1 |
71 | Come, Jo, | 1 |
72 | Jo March | 1 |
73 | Yes, Amy was | 1 |
74 | Speak for yourself | 1 |
75 | Teddy | 1 |
76 | Didn't Hayes | 1 |
77 | Everything of Amy's | 1 |
78 | Teddy's Own | 1 |
79 | Aunt Carrol | 1 |
80 | Aunt and Flo | 1 |
81 | 'The Flirtations of Capt | 1 |
82 | Rotten Row means ' | 1 |
83 | The Palais Royale | 1 |
84 | Olympia's Oath | 1 |
85 | Ah, Jo | 1 |
86 | 'Friend of the old | 1 |
87 | Constant\n Tin Soldier | 1 |
88 | A Happy New Year | 1 |
89 | Shall I tell my friend | 1 |
90 | own,--a | 1 |
91 | Mees Marsch | 1 |
92 | Weekly Volcano | 1 |
93 | Demon of the Jura | 1 |
94 | Bless | 1 |
95 | Teddy, dear | 1 |
96 | Dear little bird | 1 |
97 | Wish I was | 1 |
98 | Lazy Laurence | 1 |
99 | Saint Laurence | 1 |
100 | Rarey with Puck | 1 |
101 | O Laurie, Laurie | 1 |
102 | Yes, Laurie | 1 |
103 | the church of\none | 1 |
104 | Dear Jo | 1 |
105 | Dear old f | 1 |
106 | Miss Marsch | 1 |
107 | IN THE GARRET | 1 |
108 | Mother Bhaer | 1 |
109 | Boys and How They Turned Out.= A Sequel | 1 |
110 | The Little Women Series | 1 |
111 | Roses and Forget-me-nots | 1 |
112 | Baa! Baa | 1 |
113 | "How They Ran Away | 1 |
114 | Water-Lilies | 1 |
115 | Shadow-Children | 1 |
116 | The Moss People | 1 |
117 | Little Women | 1 |
118 | Transcriber's | 1 |
119 | The Pickwick Papers | 1 |
120 | Library | 1 |
Get NER in Context#
Show code cell source
from IPython.display import Markdown, display
import re
def get_ner_in_context(keyword, document, desired_ner_labels= False):
if desired_ner_labels != False:
desired_ner_labels = desired_ner_labels
else:
# all possible labels
desired_ner_labels = list(nlp.get_pipe('ner').labels)
#Iterate through all the sentences in the document and pull out the text of each sentence
for sentence in document.sents:
#process each sentence
sentence_doc = nlp(sentence.text)
for named_entity in sentence_doc.ents:
#Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
if keyword.lower() in named_entity.text.lower() and named_entity.label_ in desired_ner_labels:
#Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
#sentence_text = sentence.text
sentence_text = re.sub('\n', ' ', sentence.text)
sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)
print('---')
display(Markdown(f"**{named_entity.label_}**"))
display(Markdown(sentence_text))
for document in chunked_documents:
get_ner_in_context('Jupiter', document)
---
LOC
By Jupiter I will, if I only get the chance!” cried Laurie, sitting up with sudden energy.
---
PERSON
A crash, a cry, and a laugh from Laurie, accompanied by the indecorous exclamation, “Jupiter Ammon!
---
LOC
“Twins, by Jupiter!” was all he said for a minute; then, turning to the women with an appealing look that was comically piteous, he added, “Take ‘em quick, somebody!
Your Turn!#
Now it’s your turn to take a crack at NER with a whole new text!
Type Label |
Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
In this section, you’re going to extract and count named entities from The Autobiography of Benjamin Franklin.
Open and read the text file
filepath = "../texts/literature/The-Autobiography-of-Benjamin-Franklin.txt"
text = open(filepath, encoding='utf-8').read()
To process the book in smaller chunks (if working in Binder or on a computer with memory constraints):
chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))
To process the book all at once (if working on a computer with a larger amount of memory):
document = nlp(text)
1. Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you’ve chosen) in the book. If you need help, study the examples above.
2. What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?
3. What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?
4. What’s an insight that you might be able to glean about the book based on your NER extraction?