# Named Entity Recognition

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

We will be working with the English-language spaCy model in this lesson. However, with the help of Quinn Dombrowski, I am also curating tutorials for NER with other languages:

- [Chinese NER](Multilingual/Chinese/02-Named-Entity-Recognition-Chinese)
- [Danish NER](Multilingual/Danish/02-Named-Entity-Recognition-Danish)
- [Portuguese NER](Multilingual/Portuguese/02-Named-Entity-Recognition-Portuguese)
- [Russian NER](Multilingual/Russian/02-Named-Entity-Recognition-Russian)
- [Spanish NER](Multilingual/Spanish/02-Named-Entity-Recognition-Spanish)

* Please reach out if you're interested in adding another language!

---

## Dataset

### Ada Lovelace's Obituary & Louisa May Alcott's *Little Women*

<blockquote class="epigraph" style=" padding: 10px">

A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843. 

-Claire Cain Miller, ["Ada Lovelace,"](https://www.nytimes.com/interactive/2018/obituaries/overlooked-ada-lovelace.html) *New York Times* Overlooked Obituaries

</blockquote>

**Here's a preview of spaC's NER tagging Ada Lovelace's obituary:**

---

In [29]:
displacy.render(document, style="ent")

---

## Why is NER Useful?

Named Entity Recognition is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters (something we'll do in a later lesson!). Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping the locations (something we'll also do in a later lesson!).

## Natural Language Processing (NLP)

Named Entity Recognition is a fundamental task in the field of *natural language processing* (NLP). NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called *spellcheck*? How about autocomplete, Google translate, chat bots, or Siri? These are all examples of NLP in action!

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts and images are now getting eerily good.

Open-source NLP tools are getting very good, too. We're going to use one of these open-source tools, the Python library `spaCy`, for our Named Entity Recognition tasks in this lesson.

## How spaCy Works

The screenshot above shows spaCy correctly identifying named entities in Ada Lovelace's *New York Times* obituary (something that we'll test out for ourselves below). How does spaCy know that "Ada Lovelace" is a person and that "1843" is a date?

Well, spaCy doesn't *know*, not for sure anyway. Instead, spaCy is making a very educated guess. This "guess" is based on what spaCy has learned about the English language after seeing lots of other examples.

That's a colloquial way of saying: spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. These texts were, in fact, often labeled and corrected by hand. This is similar to our <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Overview.html#1)-LDA-is-an-Unsupervised-Algorithm">topic modeling work</a> from the previous lesson, except our topic model wasn't using labeled data.

The English-language spaCy model that we're going to use in this lesson was trained on an annotated corpus called ["OntoNotes"](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf): 2 million+ words drawn from "news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech," which were meticulously tagged by a group of researchers and professionals for people's names and places, for nouns and verbs, for subjects and objects, and much more. Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.

When spaCy identifies people and places in Ada Lovelace's obituary, in other words, the NLP model is actually making *predictions* about the text based on what it has learned about how people and places function in English-language sentences.

## NER with spaCy

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [7]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the English-language model (`en_core_web_sm`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated "OntoNotes" corpus. You can download the `en_core_web_sm` model by running the cell below:

In [None]:
!python -m spacy download en_core_web_sm

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian*.  

*spaCy offers language and tokenization support for other language via external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean.*

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [9]:
import en_core_web_sm
nlp = en_core_web_sm.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('en_core_web_sm')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and read Ada Lovelace's obituary. Then we run`nlp()` on the text and create our document.

In [28]:
filepath = "../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart for English-language spaCy taken from [its website](https://spacy.io/api/annotation#named-entities). This chart shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's English-language NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [30]:
displacy.render(document, style="ent")

From a quick glance at the text above, we can see that the English-language spaCy model is doing quite well with NER. But it's definitely not perfect.

Though spaCy correctly identifies "Ada Lovelace" as a `PERSON` in the first sentence, just a few sentences later it labels her as a `WORK_OF_ART`. Though spaCy correctly identifies "London" as a place `GPE` a few paragraphs down, it incorrectly identifies "Jacquard" as a place `GPE`, too (when really "Jacquard" is a type of loom, named after [Marie Jacquard](https://en.wikipedia.org/wiki/Jacquard_machine)). 

This inconsistency is very important to note and keep in mind. If we wanted to use spaCy's English-language NER model for a project, it would almost certainly require manual correction and cleaning. And even then it wouldn't be perfect. That's why understanding the limitations of this tool is so crucial. While spaCy's English-language NER can be very good for identifying entities in broad strokes, it can't be relied upon for anything exact and fine-grained — not out of the box anyway.

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from Ada Lovelace's obituary.

In [12]:
document.ents

(first,
 CLAIRE CAIN MILLER,
 A century,
 Ada Lovelace,
 1843,
 Jacquard,
 British,
 Charles Babbage’s,
 Analytical Engine,
 Lovelace,
 1852,
 36,
 first,
 seventh,
 Bernoulli,
 Bernoulli,
 Swiss,
 Jacob Bernoulli,
 Walter Isaacson,
 “The Innovators,
 The Analytical Engine,
 Lovelace,
 British,
 Lord Byron,
 Betty Alexandra Toole,
 Math,
 Lovelace,
 the mid-20th century,
 the Defense Department,
 October,
 Ada Lovelace,
 Lovelace,
 Lady Lovelace,
 The London Examiner,
 Babbage,
 Augusta Ada Byron,
 Dec. 10, 1815,
 London,
 Lord Byron,
 Annabella Milbanke,
 8,
 Lovelace,
 Smith Collection/Gado/Getty Images
 
  Lovelace,
 British,
 the day,
 Mary Somerville,
 Somerville,
 Lovelace,
 17,
 two-foot,
 almost two decades,
 William King,
 Somerville,
 1835,
 19,
 Lovelace,
 1839,
 two,
 Somerville,
 Mathematics,
 Trigonometry,
 Cubic,
 1840,
 Lovelace,
 Augustus De Morgan,
 London,
 first,
 1843,
 27,
 Lovelace,
 the Babbage Analytical Engine,
 nearly three,
 Notes,
 first,
 Ursula Martin,
 t

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [13]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

first ORDINAL
CLAIRE CAIN MILLER PERSON
A century DATE
Ada Lovelace PERSON
1843 DATE
Jacquard PERSON
British NORP
Charles Babbage’s PERSON
Analytical Engine PERSON
Lovelace PERSON
1852 DATE
36 CARDINAL
first ORDINAL
seventh ORDINAL
Bernoulli PERSON
Bernoulli PERSON
Swiss NORP
Jacob Bernoulli PERSON
Walter Isaacson PERSON
“The Innovators WORK_OF_ART
The Analytical Engine WORK_OF_ART
Lovelace PERSON
British NORP
Lord Byron ORG
Betty Alexandra Toole PERSON
Math PERSON
Lovelace PERSON
the mid-20th century DATE
the Defense Department ORG
October DATE
Ada Lovelace PERSON
Lovelace PERSON
Lady Lovelace PERSON
The London Examiner FAC
Babbage ORG
Augusta Ada Byron ORG
Dec. 10, 1815 DATE
London GPE
Lord Byron ORG
Annabella Milbanke ORG
8 DATE
Lovelace PERSON
Smith Collection/Gado/Getty Images

 Lovelace ORG
British NORP
the day DATE
Mary Somerville PERSON
Somerville GPE
Lovelace PERSON
17 DATE
two-foot QUANTITY
almost two decades DATE
William King PERSON
Somerville GPE
1835 DATE
19 DATE
Lovelace 

To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [14]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

CLAIRE CAIN MILLER
Ada Lovelace
Jacquard
Charles Babbage’s
Analytical Engine
Lovelace
Bernoulli
Bernoulli
Jacob Bernoulli
Walter Isaacson
Lovelace
Betty Alexandra Toole
Math
Lovelace
Ada Lovelace
Lovelace
Lady Lovelace
Lovelace
Mary Somerville
Lovelace
William King
Lovelace
Lovelace
Augustus De Morgan
Lovelace
Ursula Martin
Lovelace
Lovelace
Claire Cain Miller
Ada Lovelace


## NER with Long Texts or Many Texts

For the rest of this lesson, we're going to work with Edward P. Jones's short story collection *Lost in the City*, specifically the first story, "The Girl Who Raised Pigeons."

In [15]:
filepath = "../texts/literature/Little-Women_Louisa-May-Alcott.txt"
text = open(filepath).read()

In [16]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [17]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PERSON."

:::{admonition} Pandas Review
:class: pandasreview
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
    
:::

In [18]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PERSON":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Jo,1256
1,Amy,645
2,Laurie,570
3,Beth,465
4,Meg,311
5,John,144
6,Hannah,122
7,Brooke,96
8,Laurence,85
9,Bhaer,77


## Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "GPE" or "LOC." These are the type labels for "counties cities, states" and "locations, mountain ranges, bodies of water."

In [19]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Washington,13
1,Nice,12
2,Paris,10
3,Belle,9
4,china,8
5,Rome,8
6,Tina,8
7,Demi,8
8,America,7
9,Plumfield,7


## Get Streets & Parks

|Type Label|Description|
|:---:|:---:|
|FAC|Buildings, airports, highways, bridges, etc.|

To extract and count streets and parks (which show up a lot in *Lost in the City*!), we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "FAC." This is the type label for "buildings, airports, highways, bridges, etc."

In [20]:
streets = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "FAC":
            streets.append(named_entity.text)

streets_tally = Counter(streets)

df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df

Unnamed: 0,street,count
0,Pickwick Hall,1
1,the Earl of Devereux,1
2,the Tower of Babel,1
3,the Barnville Theatre,1
4,Loved,1
5,Camp Laurence,1
6,the moon,1
7,the\ngate,1
8,Sphinx,1
9,the Bath Hotel,1


## Get Works of Art

|Type Label|Description|
|:---:|:---:|
|WORK_OF_ART|Titles of books, songs, etc.|

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the "ent" label "WORK_OF_ART").

In [21]:
works_of_art = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "WORK_OF_ART":
            works_of_art.append(named_entity.text)

            art_tally = Counter(works_of_art)

df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df

Unnamed: 0,work_of_art,count
0,Meg,4
1,Merry Christmas,3
2,Aunt March,3
3,Dear me,3
4,the List of Illustrations,3
5,"the ""Heir of Redclyffe",2
6,Christopher Columbus,2
7,"the ""Busy Bee Society",2
8,Hamlet,2
9,O Laurie,2


## Get NER in Context

In [23]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        # all possible labels
        desired_ner_labels =  list(nlp.get_pipe('ner').labels)  

        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                print('---')
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [24]:
for document in chunked_documents:
    get_ner_in_context('Jupiter', document)

---


**LOC**

By **Jupiter** I will, if I only get the chance!" cried Laurie, sitting up with sudden energy.

---


**PERSON**

A crash, a cry, and a laugh from Laurie, accompanied by the indecorous exclamation, "**Jupiter Ammon**!

---


**LOC**

"Twins, by **Jupiter**!" was all he said for a minute; then, turning to the women with an appealing look that was comically piteous, he added, "Take 'em quick, somebody!

## Your Turn!

Now it's your turn to take a crack at NER with a whole new text!


```{toggle}
|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|
```

In this section, you're going to extract and count named entities from *The Autobiography of Benjamin Franklin*.

Open and read the text file

In [187]:
filepath = "../texts/literature/The-Autobiography-of-Benjamin-Franklin.txt"
text = open(filepath, encoding='utf-8').read()

To process the book in smaller chunks (if working in Binder or on a computer with memory constraints):

In [188]:
chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))

To process the book all at once (if working on a computer with a larger amount of memory):

In [62]:
document = nlp(text)

**1.** Choose a named entity from the possible spaCy named entities listed above. Extract, count, and make a dataframe from the most frequent named entities (of the type that you've chosen) in the book. If you need help, study the examples above.

In [None]:
#Your Code Here 👇 

**2.** What is a result from this NER extraction that conformed to your expectations, that you find obvious or predictable? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**3.** What is a result from this NER extraction that defied your expectations, that you find curious or counterintuitive? Why?

**#**Your answer here. (Double click this cell to type your answer.)

**4.** What's an insight that you might be able to glean about the book based on your NER extraction?

**#**Your answer here. (Double click this cell to type your answer.)